| An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages |
| Full text |
Pdf
(811 KB)
|
| Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Athens, Greece
Pages: 160 - 167
Year of Publication: 2000
ISBN:1-58113-226-3
|
|
Authors
|
|
Ion Androutsopoulos
|
Software and Knowledge Engineering Laboratory, Institute of Informatics and Telecommunications, National Centre for Scientific Research 'Demokritos', 153 10 Ag. Paraskevi, Athens, Greece
|
|
John Koutsias
|
Software and Knowledge Engineering Laboratory, Institute of Informatics and Telecommunications, National Centre for Scientific Research 'Demokritos', 153 10 Ag. Paraskevi, Athens, Greece
|
|
Konstantinos V. Chandrinos
|
Software and Knowledge Engineering Laboratory, Institute of Informatics and Telecommunications, National Centre for Scientific Research 'Demokritos', 153 10 Ag. Paraskevi, Athens, Greece
|
|
Constantine D. Spyropoulos
|
Software and Knowledge Engineering Laboratory, Institute of Informatics and Telecommunications, National Centre for Scientific Research 'Demokritos', 153 10 Ag. Paraskevi, Athens, Greece
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 29, Downloads (12 Months): 200, Citation Count: 35
|
|
|
ABSTRACT
The growing problem of unsolicited bulk e-mail, also known as “spam”, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in “encrypted” form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
W.W. Cohen. Learning Rules that Classify E-Mail. In Proc. of the AAAI Spring Symposium on Machine Learning in Information Access, Stanford, California, 1996.
|
 |
3
|
|
| |
4
|
|
| |
5
|
I. Dagan, Y. Karov and D. Roth. Mistake-Driven Learning in Text Categorization. In Proc. of the 2 na Conference on Empirical Methods in Natural Language Processing, pp. 55- 63, Providence, Rhode Island, 1997.
|
| |
6
|
P. Domingos and M. Pazzani. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In Proc. o f the 13 th International Conference on Machine Learning, pp. 105-112, Bari, Italy, 1996.
|
| |
7
|
R.O. Duda and P.E. Hart. Bayes Decision Theory. Chapter 2 in Pattern Classification and Scene Analysis, pp. 10-43. John Wiley, 1973.
|
| |
8
|
D. Forsyth. Finding Naked People. In Proc. of the 4 th European Conference on Computer Vision, Cambridge, England, 1996.
|
| |
9
|
K.T. Frantzi. Automatic Recognition of Multi-Word Terms. PhD Thesis, Manchester Metropolitan University, England, 1998.
|
| |
10
|
|
| |
11
|
C.L. Green and P. Edwards. Using Machine Learning to Enhance Software Tools for lnternet Information Management. In Proc. o f the AAAI Workshop on Internet-Based Information Systems, pp. 48-55, Portland, Oregon, 1996.
|
 |
12
|
|
 |
13
|
|
| |
14
|
K. Lang. Newsweeder: Learning to Filter Netnews. In Proc. of the 12 th International Conference on Machine Learning, pp. 331-339, Stanford, California, 1995.
|
| |
15
|
P. Langley, I. Wayne and K. Thompson. An Analysis of Bayesian Classifiers. In Proc. o f the 10 h National Conference on Artificial Intelligence, pp. 223-228, San Jose, California, 1992.
|
| |
16
|
|
 |
17
|
David D. Lewis , Robert E. Schapire , James P. Callan , Ron Papka, Training algorithms for linear text classifiers, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p.298-306, August 18-22, 1996, Zurich, Switzerland
[doi> 10.1145/243199.243277]
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
T.R. Payne and P. Edwards. Interface Agents that Learn: An Investigation of Learning Issues in a Mail Agent Interface. Applied Artificial Intelligence, 11 (1): 1-32, 1997.
|
 |
22
|
|
| |
23
|
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization - Papers fi'om the AAA1 Workshop, pp. 55-62, Madison Wisconsin. AAAI Technical Report WS-98- 05, 1998.
|
| |
24
|
|
| |
25
|
E. Spertus. Smokey: Automatic Recognition of Hostile Messages. In Proc. of the 14 th National Conference on AI and th the 9 Conference on Innovative Applications of A1, pp. 1058- 1065, Providence, Rhode Island, 1997.
|
CITED BY 35
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shilad Sen , Werner Geyer , Michael Muller , Marty Moore , Beth Brownholtz , Eric Wilcox , David R. Millen, FeedMe: a collaborative alert filtering system, Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work, November 04-08, 2006, Banff, Alberta, Canada
|
|
|
Saeed Abu-Nimeh , Dario Nappa , Xinlei Wang , Suku Nair, A comparison of machine learning techniques for phishing detection, Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit, p.60-69, October 04-05, 2007, Pittsburgh, Pennsylvania
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shan-Hung Wu , Keng-Pei Lin , Chung-Min Chen , Ming-Syan Chen, Asymmetric support vector machines: low false-positive learning under the user tolerance, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|