ACM Home Page
Please provide us with feedback. Feedback
Raising the baseline for high-precision text classifiers
Full text MovMov (16:45),  PdfPdf (903 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Jose, California, USA
SESSION: Research track papers table of contents
Pages: 400 - 409  
Year of Publication: 2007
ISBN:978-1-59593-609-7
Authors
Aleksander Kolcz  Microsoft
Wen-tau Yih  Microsoft
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 107,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281192.1281237
What is a DOI?

ABSTRACT

Many important application areas of text classifiers demand high precision andit is common to compare prospective solutions to the performance of Naive Bayes. This baseline is usually easy to improve upon, but in this work we demonstrate that appropriate document representation can make out performing this classifier much more challenging. Most importantly, we provide a link between Naive Bayes and the logarithmic opinion pooling of the mixture-of-experts framework, which dictates a particular type of document length normalization. Motivated by document-specific feature selection we propose monotonic constraints on document term weighting, which is shown as an effective method of fine-tuning document representation. The discussion is supported by experiments using three large email corpora corresponding to the problem of spam detection, where high precision is of particular importance.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
P. Bennett. Assessing the calibration of naive Bayes' posterior estimates. Technical Report CMU-CS-00-155, School of Computer Science, Carnegie Mellon University, 2000.
 
2
L. Chen, J. Huang, and Z. Gong. An anti-noise text categorization method based on support vector machines. In Proceedings of AWIC 2005, pages 272--278, 20025.
 
3
G. Cormack. The TREC 2006 spam filter evaluation track. Virus Bulletin, (1), 2007.
 
4
G. Cormack and A. Bratko. Batch and online spam filter comparison. In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS-2006), 2006.
 
5
G. Cormack and T. Lynam. TREC 2005 spam track overview. In Proceedings of TREC 2005 - the Fourteenth Text REtrieval Conference, 2005.
6
 
7
 
8
H. Drucker, D. Wu, and V. Vapnik. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks, 10(5):1048--1054, 1999.
 
9
 
10
C. Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973--978, 2001.
 
11
 
12
 
13
C. Genest and J. Zidek. Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1986.
 
14
 
15
P. Graham. A plan for spam http://www.paulgraham.com/spam.html, 2002.
 
16
G. Hinton. Products of experts. In Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN99), pages 1--6, 1999.
 
17
 
18
A. Juan and H. Ney. Reversing and smoothing the multinomial naive Bayes text classifier. In Proceedings of the 2nd Int. Workshop on Pattern Recognition in Information Systems (PRIS 2002), pages 200--212, 2002.
 
19
20
 
21
A. Kołcz and J. Alspector. SVM-based filtering of e-mail spam with content-specific misclassification costs. In Proceedings of the Workshop on Text Mining (TextDM'2001), 2001.
 
22
M. Lee and E. Corlett. Sequential sampling models of human text classification. Cognitive Science, 27(2):159--1193, 2003.
 
23
 
24
A. K. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
 
25
V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with Naive Bayes - which Naive Bayes? In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS-2006), 2006.
26
 
27
 
28
R. Raina, Y. Shen, A. Ng, and A. McCallum. Classification with hybrid generative/discriminative models. In Proceedings of NIPS 16, 2004.
 
29
J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of Naive Bayes text classifiers. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.
 
30
 
31
M. Sauban and B. Pfahringer. Text categorisation using document profiling. In Proceedings of PKDD 2003, pages 411--422, 2003.
 
32
K. Schneider. Techniques for improving the performance of naive Bayes for text classification. In Proceedings of CICLing 2005, pages 682--693, 2005.
 
33
P. Soucy and G. Mineau. Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005), pages 1130--1135, 2005.
34
 
35
W. Yih, J. Goodman, and G. Hulten. Learning at low false positive rates. In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS-2006), 2006.
 
36


Collaborative Colleagues:
Aleksander Kolcz: colleagues
Wen-tau Yih: colleagues