ACM Home Page
Please provide us with feedback. Feedback
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages
Full text PdfPdf (811 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Athens, Greece
Pages: 160 - 167  
Year of Publication: 2000
ISBN:1-58113-226-3
Authors
Ion Androutsopoulos  Software and Knowledge Engineering Laboratory, Institute of Informatics and Telecommunications, National Centre for Scientific Research 'Demokritos', 153 10 Ag. Paraskevi, Athens, Greece
John Koutsias  Software and Knowledge Engineering Laboratory, Institute of Informatics and Telecommunications, National Centre for Scientific Research 'Demokritos', 153 10 Ag. Paraskevi, Athens, Greece
Konstantinos V. Chandrinos  Software and Knowledge Engineering Laboratory, Institute of Informatics and Telecommunications, National Centre for Scientific Research 'Demokritos', 153 10 Ag. Paraskevi, Athens, Greece
Constantine D. Spyropoulos  Software and Knowledge Engineering Laboratory, Institute of Informatics and Telecommunications, National Centre for Scientific Research 'Demokritos', 153 10 Ag. Paraskevi, Athens, Greece
Sponsors
Athens U of Econ & Business : Athens University of Economics and Business
Greek Com Soc : Greek Computer Society
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 29,   Downloads (12 Months): 200,   Citation Count: 35
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/345508.345569
What is a DOI?

ABSTRACT

The growing problem of unsolicited bulk e-mail, also known as “spam”, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in “encrypted” form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
W.W. Cohen. Learning Rules that Classify E-Mail. In Proc. of the AAAI Spring Symposium on Machine Learning in Information Access, Stanford, California, 1996.
3
 
4
 
5
I. Dagan, Y. Karov and D. Roth. Mistake-Driven Learning in Text Categorization. In Proc. of the 2 na Conference on Empirical Methods in Natural Language Processing, pp. 55- 63, Providence, Rhode Island, 1997.
 
6
P. Domingos and M. Pazzani. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In Proc. o f the 13 th International Conference on Machine Learning, pp. 105-112, Bari, Italy, 1996.
 
7
R.O. Duda and P.E. Hart. Bayes Decision Theory. Chapter 2 in Pattern Classification and Scene Analysis, pp. 10-43. John Wiley, 1973.
 
8
D. Forsyth. Finding Naked People. In Proc. of the 4 th European Conference on Computer Vision, Cambridge, England, 1996.
 
9
K.T. Frantzi. Automatic Recognition of Multi-Word Terms. PhD Thesis, Manchester Metropolitan University, England, 1998.
 
10
 
11
C.L. Green and P. Edwards. Using Machine Learning to Enhance Software Tools for lnternet Information Management. In Proc. o f the AAAI Workshop on Internet-Based Information Systems, pp. 48-55, Portland, Oregon, 1996.
12
13
 
14
K. Lang. Newsweeder: Learning to Filter Netnews. In Proc. of the 12 th International Conference on Machine Learning, pp. 331-339, Stanford, California, 1995.
 
15
P. Langley, I. Wayne and K. Thompson. An Analysis of Bayesian Classifiers. In Proc. o f the 10 h National Conference on Artificial Intelligence, pp. 223-228, San Jose, California, 1992.
 
16
17
 
18
 
19
 
20
 
21
T.R. Payne and P. Edwards. Interface Agents that Learn: An Investigation of Learning Issues in a Mail Agent Interface. Applied Artificial Intelligence, 11 (1): 1-32, 1997.
22
 
23
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization - Papers fi'om the AAA1 Workshop, pp. 55-62, Madison Wisconsin. AAAI Technical Report WS-98- 05, 1998.
 
24
 
25
E. Spertus. Smokey: Automatic Recognition of Hostile Messages. In Proc. of the 14 th National Conference on AI and th the 9 Conference on Innovative Applications of A1, pp. 1058- 1065, Providence, Rhode Island, 1997.

CITED BY  35

Collaborative Colleagues:
Ion Androutsopoulos: colleagues
John Koutsias: colleagues
Konstantinos V. Chandrinos: colleagues
Constantine D. Spyropoulos: colleagues