|
ABSTRACT
Machine learning and data mining can be effectively used to model, classify and discover interesting information for a wide variety of data including email. The Email Mining Toolkit, EMT, has been designed to provide a wide range of analyses for arbitrary email sources. Depending upon the task, one can usually achieve very high accuracy, but with some amount of false positive tradeoff. Generally false positives are prohibitively expensive in the real world. In the case of spam detection, for example, even if one email is misclassified, this may be unacceptable if it is a very important email. Much work has been done to improve specific algorithms for the task of detecting unwanted messages, but less work has been report on leveraging multiple algorithms and correlating models in this particular domain of email analysis.EMT has been updated with new correlation functions allowing the analyst to integrate a number of EMT's user behavior models available in the core technology. We present results of combining classifier outputs for improving both accuracy and reducing false positives for the problem of spam detection. We apply these methods to a very large email data set and show results of different combination methods on these corpora. We introduce a new method to compare multiple and combined classifiers, and show how it differs from past work. The method analyzes the relative gain and maximum possible accuracy that can be achieved for certain combinations of classifiers to automatically choose the best combination.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G. and Spyropoulos, C. An Evauation of Naïve Bayesian Anti-Spam Filtering.
|
 |
2
|
Ion Androutsopoulos , John Koutsias , Konstantinos V. Chandrinos , Constantine D. Spyropoulos, An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, p.160-167, July 24-28, 2000, Athens, Greece
[doi> 10.1145/345508.345569]
|
| |
3
|
Asker, L. and Maclin, R., Ensembles as a Sequence of Classifiers. in 15th International Joint Conference on Artificial Intelligence, (Nagoya, Japan, 1997), 860--865.
|
 |
4
|
|
| |
5
|
Carreras, X. and Mrquez, L., Boosting trees for anti-spam email filtering. in RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, (Tzigov Chark, BG, 2001).
|
| |
6
|
Clemen, R.T. Combining forecasts: A revew and annotated bibliography. International Journal of Forecasting, 5. 559 -- 583.
|
| |
7
|
Cohen, W., Learning rules that classify e-mail. in Machine Learning in Information Access: AAAI Spring Symposium (SS-96-05), (1996), 18--25.
|
| |
8
|
Damashek, M. Gauging Similarity via N-Grams: Language-Independant Sorting, Categorization and Retrieval of Text. Science, 267. 843--848.
|
| |
9
|
|
| |
10
|
Drucker, H., Wu, D. and Vapnik, V.N. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural networks, 10 (5).
|
| |
11
|
Duda, R. and Hart, P. Pattern classification and scene analysis. John Wiley & Sons, New York, 1973.
|
| |
12
|
Graham, P. A Plan For Spam, 2003.
|
| |
13
|
Hallam-Baker, P. A Plan For No Spam, Verisign, 2003.
|
| |
14
|
Hershkop, S. Using URL Clustering to Classify Spam, Columbia University, 2005.
|
| |
15
|
Hershkop, S. and Stolfo, S.J. Identifying Spam without Peeking at the Contents. ACM Crossroads.
|
| |
16
|
Hershkop, S., Wang, K., Lee, W. and Nimeskern, O. Email Mining Toolkit Technical Manual, Computer Science Dept, Columbia University, New York, 2004.
|
| |
17
|
|
| |
18
|
Itskevitch, J. Automatic Hierarchical E-Mail Classification Using Association Rules, 2001.
|
| |
19
|
John, G. and Langley, P., Estimating continuous distributions in Bayesian classifiers. in Eleventh Conference on Uncertainty in Artificial Intelligence, (1995), 338--345.
|
| |
20
|
Katirai, H. Filtering Junk E-Mail: A Performance Comparison between Genetic Programming and Naive Bayes, 1999.
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
Kolcz, A. and Alspector, J., SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs. in Workshop on Text Mining (TextDM'2001), (San Jose, California, 2001).
|
 |
25
|
|
| |
26
|
Littlestone, N. and Warmuth, M.K. The Weighted Majority Algorithm. IEEE Symposium on Foundations of Computer Science.
|
| |
27
|
Manber, U., Finding Similar Files in a Large File System. in Usenix Winter, (San Fransisco, CA, 1994), 1--10.
|
| |
28
|
Massey, B., Thomure, M., Budrevich, R. and Long, S., Learning Spam: Simple Techniques for Freely-Available Software. in USENIX 2003, (2003).
|
| |
29
|
Mitchel, T. Machine Learning. McGraw-Hill, 1997.
|
| |
30
|
Peng, F. and Schuurmans, D., Combining Naive Bayes and n-Gram Language Models for Text Classi cation. in 25th European Conference on Information Retrieval Research (ECIR), (2003).
|
 |
31
|
|
| |
32
|
|
| |
33
|
Provost, J. Naïve-Bayes vs. Rule-Learning in Classification of Email, 1999.
|
| |
34
|
Rennie, J., ifile: An Application of Machine Learning to E-mail Filtering. in KDD-2000 Workshop on Text Mining, (2000).
|
| |
35
|
Rigoutsos, I. and Huynh, T., Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages. in ceas 2004, (Mountain View, California, 2004).
|
| |
36
|
Sahami, M., Dumais, S., Heckerman, D. and Horvitz, E., A Bayesian approach to filtering junk e-mail. in AAAI-98 Workshop on Learning for Text Categorization, (1998).
|
| |
37
|
Sakkis, G., Androutsopolous, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. and Stamatopoulos, P., Stacking classifiers for Anti-Spam Filtering of Emails. in 6th conference on Empirical Methods in Natural Language Processing (EMNLP 2001), (2001).
|
| |
38
|
|
| |
39
|
|
 |
40
|
|
| |
41
|
Stolfo, S.J., Hershkop, S., Wang, K., Nimeskern, O. and Hu, C.-W. A Behavior-based Approach to Securing Email Systems. Mathematical Methods, Models and Architectures for Computer Networks Security.
|
| |
42
|
Stolfo, S.J., Hershkop, S., Wang, K., Nimeskern, O. and Hu, C.-W., Behavior Profiling of Email. in 1st NSF/NIJ Symposium on Intelligence & Security Informatics(ISI 2003), (Tucson, Arizona, 2003).
|
 |
43
|
|
|