| On-line spam filter fusion |
| Full text |
Pdf
(290 KB)
|
| Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Seattle, Washington, USA
SESSION: Fusion and spam
table of contents
Pages: 123 - 130
Year of Publication: 2006
ISBN:1-59593-369-7
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 15, Downloads (12 Months): 131, Citation Count: 8
|
|
|
ABSTRACT
We show that a set of independently developed spam filters may be combined in simple ways to provide substantially better filtering than any of the individual filters. The results of fifty-three spam filters evaluated at the TREC 2005 Spam Track were combined post-hoc so as to simulate the parallel on-line operation of the filters. The combined results were evaluated using the TREC methodology, yielding more than a factor of two improvement over the best filter. The simplest method -- averaging the binary classifications returned by the individual filters -- yields a remarkably good result. A new method -- averaging log-odds estimates based on the scores returned by the individual filters -- yields a somewhat better result, and provides input to SVM- and logistic-regression-based stacking methods. The stacking methods appear to provide further improvement, but only for very large corpora. Of the stacking methods, logistic regression yields the better result. Finally, we show that it is possible to select a priori small subsets of the filters that, when combined, still outperform the best individual filter by a substantial margin.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Attia, J. Moving beyond sensistivity and specificity: using likelihood ratios to help interpret diagnostic tests. Australian Prescriber 26, 5 (2003), 111--113.
|
| |
2
|
|
| |
3
|
N. J. Belkin , P. Kantor , E. A. Fox , J. A. Shaw, Combining the evidence of multiple query representations for information retrieval, Proceedings of the second conference on Text retrieval conference, p.431-448, May 1995, Washington, D.C., United States
|
| |
4
|
|
 |
5
|
|
| |
6
|
Cormack, G. V., and Bratko, A. Batch and on-line spam filter evaluation. In CEAS 2006 -- The 3rd Conference on Email and Anti-Spam (Mountain View, 2006).
|
| |
7
|
Cormack, G. V., and Lynam, T. R. Overview of the TREC 2005 Spam Evaluation Track. In Fourteenth Text REtrieval Conference (TREC-2005) (Gaithersburg, MD, 2005), NIST.
|
 |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
Fawcett, T. ROC graphs: Notes and practical considerations for researchers. Tech. Rep. HPL-2003-4, HP Laboratories, 2004.
|
| |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
Komarek, P., and Moore, A. Fast robust logistic regression for large sparse datasets with binary outputs. In Artificial Intelligence and Statistics (2003).
|
 |
17
|
|
 |
18
|
David D. Lewis , Robert E. Schapire , James P. Callan , Ron Papka, Training algorithms for linear text classifiers, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p.298-306, August 18-22, 1996, Zurich, Switzerland
[doi> 10.1145/243199.243277]
|
| |
19
|
Lynam, T., and Cormack, G. TREC Spam Filter Evaluation Took Kit. http://plg.uwaterloo.ca/~trlynam/spamjig.
|
 |
20
|
Thomas R. Lynam , Chris Buckley , Charles L. A. Clarke , Gordon V. Cormack, A multi-system analysis of document and term selection for blind feedback, Proceedings of the thirteenth ACM international conference on Information and knowledge management, November 08-13, 2004, Washington, D.C., USA
[doi> 10.1145/1031171.1031229]
|
 |
21
|
|
| |
22
|
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. D., and Stamatopoulos, P. Stacking classifiers for anti-spam filtering of e-mail, 2001.
|
 |
23
|
|
| |
24
|
Segal, R., Crawford, J., Kephart, J., and Leiba, B. SpamGuru: An enterprise anti-spam filtering system. In First Conference on Email and Anti-Spam (CEAS) (2004).
|
| |
25
|
Shaw, J. A., and Fox, E. A. Combination of multiple searches. In Text REtrieval Conference (1994).
|
| |
26
|
Voorhees, E. Fourteenth Text REtrieval Conference (TREC-2005). NIST, Gaithersburg, MD, 2005.
|
| |
27
|
|
 |
28
|
|
CITED BY 8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shan-Hung Wu , Keng-Pei Lin , Chung-Min Chen , Ming-Syan Chen, Asymmetric support vector machines: low false-positive learning under the user tolerance, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|