ACM Home Page
Please provide us with feedback. Feedback
On-line spam filter fusion
Full text PdfPdf (290 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Seattle, Washington, USA
SESSION: Fusion and spam table of contents
Pages: 123 - 130  
Year of Publication: 2006
ISBN:1-59593-369-7
Authors
Thomas R. Lynam  University of Waterloo, Waterloo, Ontario, Canada
Gordon V. Cormack  University of Waterloo, Waterloo, Ontario, Canada
David R. Cheriton  University of Waterloo, Waterloo, Ontario, Canada
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 131,   Citation Count: 8
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1148170.1148195
What is a DOI?

ABSTRACT

We show that a set of independently developed spam filters may be combined in simple ways to provide substantially better filtering than any of the individual filters. The results of fifty-three spam filters evaluated at the TREC 2005 Spam Track were combined post-hoc so as to simulate the parallel on-line operation of the filters. The combined results were evaluated using the TREC methodology, yielding more than a factor of two improvement over the best filter. The simplest method -- averaging the binary classifications returned by the individual filters -- yields a remarkably good result. A new method -- averaging log-odds estimates based on the scores returned by the individual filters -- yields a somewhat better result, and provides input to SVM- and logistic-regression-based stacking methods. The stacking methods appear to provide further improvement, but only for very large corpora. Of the stacking methods, logistic regression yields the better result. Finally, we show that it is possible to select a priori small subsets of the filters that, when combined, still outperform the best individual filter by a substantial margin.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Attia, J. Moving beyond sensistivity and specificity: using likelihood ratios to help interpret diagnostic tests. Australian Prescriber 26, 5 (2003), 111--113.
 
2
 
3
 
4
5
 
6
Cormack, G. V., and Bratko, A. Batch and on-line spam filter evaluation. In CEAS 2006 -- The 3rd Conference on Email and Anti-Spam (Mountain View, 2006).
 
7
Cormack, G. V., and Lynam, T. R. Overview of the TREC 2005 Spam Evaluation Track. In Fourteenth Text REtrieval Conference (TREC-2005) (Gaithersburg, MD, 2005), NIST.
8
 
9
 
10
 
11
Fawcett, T. ROC graphs: Notes and practical considerations for researchers. Tech. Rep. HPL-2003-4, HP Laboratories, 2004.
 
12
13
 
14
 
15
 
16
Komarek, P., and Moore, A. Fast robust logistic regression for large sparse datasets with binary outputs. In Artificial Intelligence and Statistics (2003).
17
18
 
19
Lynam, T., and Cormack, G. TREC Spam Filter Evaluation Took Kit. http://plg.uwaterloo.ca/~trlynam/spamjig.
20
21
 
22
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. D., and Stamatopoulos, P. Stacking classifiers for anti-spam filtering of e-mail, 2001.
23
 
24
Segal, R., Crawford, J., Kephart, J., and Leiba, B. SpamGuru: An enterprise anti-spam filtering system. In First Conference on Email and Anti-Spam (CEAS) (2004).
 
25
Shaw, J. A., and Fox, E. A. Combination of multiple searches. In Text REtrieval Conference (1994).
 
26
Voorhees, E. Fourteenth Text REtrieval Conference (TREC-2005). NIST, Gaithersburg, MD, 2005.
 
27
28

CITED BY  8

Collaborative Colleagues:
Thomas R. Lynam: colleagues
Gordon V. Cormack: colleagues
David R. Cheriton: colleagues