ACM Home Page
Please provide us with feedback. Feedback
Combining email models for false positive reduction
Full text PdfPdf (485 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining table of contents
Chicago, Illinois, USA
SESSION: Research track paper table of contents
Pages: 98 - 107  
Year of Publication: 2005
ISBN:1-59593-135-X
Authors
Shlomo Hershkop  Columbia University, New York, NY
Salvatore J. Stolfo  Columbia University, New York, NY
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 124,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1081870.1081885
What is a DOI?

ABSTRACT

Machine learning and data mining can be effectively used to model, classify and discover interesting information for a wide variety of data including email. The Email Mining Toolkit, EMT, has been designed to provide a wide range of analyses for arbitrary email sources. Depending upon the task, one can usually achieve very high accuracy, but with some amount of false positive tradeoff. Generally false positives are prohibitively expensive in the real world. In the case of spam detection, for example, even if one email is misclassified, this may be unacceptable if it is a very important email. Much work has been done to improve specific algorithms for the task of detecting unwanted messages, but less work has been report on leveraging multiple algorithms and correlating models in this particular domain of email analysis.EMT has been updated with new correlation functions allowing the analyst to integrate a number of EMT's user behavior models available in the core technology. We present results of combining classifier outputs for improving both accuracy and reducing false positives for the problem of spam detection. We apply these methods to a very large email data set and show results of different combination methods on these corpora. We introduce a new method to compare multiple and combined classifiers, and show how it differs from past work. The method analyzes the relative gain and maximum possible accuracy that can be achieved for certain combinations of classifiers to automatically choose the best combination.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G. and Spyropoulos, C. An Evauation of Naïve Bayesian Anti-Spam Filtering.
2
 
3
Asker, L. and Maclin, R., Ensembles as a Sequence of Classifiers. in 15th International Joint Conference on Artificial Intelligence, (Nagoya, Japan, 1997), 860--865.
4
 
5
Carreras, X. and Mrquez, L., Boosting trees for anti-spam email filtering. in RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, (Tzigov Chark, BG, 2001).
 
6
Clemen, R.T. Combining forecasts: A revew and annotated bibliography. International Journal of Forecasting, 5. 559 -- 583.
 
7
Cohen, W., Learning rules that classify e-mail. in Machine Learning in Information Access: AAAI Spring Symposium (SS-96-05), (1996), 18--25.
 
8
Damashek, M. Gauging Similarity via N-Grams: Language-Independant Sorting, Categorization and Retrieval of Text. Science, 267. 843--848.
 
9
 
10
Drucker, H., Wu, D. and Vapnik, V.N. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural networks, 10 (5).
 
11
Duda, R. and Hart, P. Pattern classification and scene analysis. John Wiley & Sons, New York, 1973.
 
12
Graham, P. A Plan For Spam, 2003.
 
13
Hallam-Baker, P. A Plan For No Spam, Verisign, 2003.
 
14
Hershkop, S. Using URL Clustering to Classify Spam, Columbia University, 2005.
 
15
Hershkop, S. and Stolfo, S.J. Identifying Spam without Peeking at the Contents. ACM Crossroads.
 
16
Hershkop, S., Wang, K., Lee, W. and Nimeskern, O. Email Mining Toolkit Technical Manual, Computer Science Dept, Columbia University, New York, 2004.
 
17
 
18
Itskevitch, J. Automatic Hierarchical E-Mail Classification Using Association Rules, 2001.
 
19
John, G. and Langley, P., Estimating continuous distributions in Bayesian classifiers. in Eleventh Conference on Uncertainty in Artificial Intelligence, (1995), 338--345.
 
20
Katirai, H. Filtering Junk E-Mail: A Performance Comparison between Genetic Programming and Naive Bayes, 1999.
 
21
 
22
 
23
 
24
Kolcz, A. and Alspector, J., SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs. in Workshop on Text Mining (TextDM'2001), (San Jose, California, 2001).
25
 
26
Littlestone, N. and Warmuth, M.K. The Weighted Majority Algorithm. IEEE Symposium on Foundations of Computer Science.
 
27
Manber, U., Finding Similar Files in a Large File System. in Usenix Winter, (San Fransisco, CA, 1994), 1--10.
 
28
Massey, B., Thomure, M., Budrevich, R. and Long, S., Learning Spam: Simple Techniques for Freely-Available Software. in USENIX 2003, (2003).
 
29
Mitchel, T. Machine Learning. McGraw-Hill, 1997.
 
30
Peng, F. and Schuurmans, D., Combining Naive Bayes and n-Gram Language Models for Text Classi cation. in 25th European Conference on Information Retrieval Research (ECIR), (2003).
31
 
32
 
33
Provost, J. Naïve-Bayes vs. Rule-Learning in Classification of Email, 1999.
 
34
Rennie, J., ifile: An Application of Machine Learning to E-mail Filtering. in KDD-2000 Workshop on Text Mining, (2000).
 
35
Rigoutsos, I. and Huynh, T., Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages. in ceas 2004, (Mountain View, California, 2004).
 
36
Sahami, M., Dumais, S., Heckerman, D. and Horvitz, E., A Bayesian approach to filtering junk e-mail. in AAAI-98 Workshop on Learning for Text Categorization, (1998).
 
37
Sakkis, G., Androutsopolous, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. and Stamatopoulos, P., Stacking classifiers for Anti-Spam Filtering of Emails. in 6th conference on Empirical Methods in Natural Language Processing (EMNLP 2001), (2001).
 
38
 
39
40
 
41
Stolfo, S.J., Hershkop, S., Wang, K., Nimeskern, O. and Hu, C.-W. A Behavior-based Approach to Securing Email Systems. Mathematical Methods, Models and Architectures for Computer Networks Security.
 
42
Stolfo, S.J., Hershkop, S., Wang, K., Nimeskern, O. and Hu, C.-W., Behavior Profiling of Email. in 1st NSF/NIJ Symposium on Intelligence & Security Informatics(ISI 2003), (Tucson, Arizona, 2003).
43


Collaborative Colleagues:
Shlomo Hershkop: colleagues
Salvatore J. Stolfo: colleagues