|
ABSTRACT
Many important application areas of text classifiers demand high precision andit is common to compare prospective solutions to the performance of Naive Bayes. This baseline is usually easy to improve upon, but in this work we demonstrate that appropriate document representation can make out performing this classifier much more challenging. Most importantly, we provide a link between Naive Bayes and the logarithmic opinion pooling of the mixture-of-experts framework, which dictates a particular type of document length normalization. Motivated by document-specific feature selection we propose monotonic constraints on document term weighting, which is shown as an effective method of fine-tuning document representation. The discussion is supported by experiments using three large email corpora corresponding to the problem of spam detection, where high precision is of particular importance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
P. Bennett. Assessing the calibration of naive Bayes' posterior estimates. Technical Report CMU-CS-00-155, School of Computer Science, Carnegie Mellon University, 2000.
|
| |
2
|
L. Chen, J. Huang, and Z. Gong. An anti-noise text categorization method based on support vector machines. In Proceedings of AWIC 2005, pages 272--278, 20025.
|
| |
3
|
G. Cormack. The TREC 2006 spam filter evaluation track. Virus Bulletin, (1), 2007.
|
| |
4
|
G. Cormack and A. Bratko. Batch and online spam filter comparison. In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS-2006), 2006.
|
| |
5
|
G. Cormack and T. Lynam. TREC 2005 spam track overview. In Proceedings of TREC 2005 - the Fourteenth Text REtrieval Conference, 2005.
|
 |
6
|
|
| |
7
|
|
| |
8
|
H. Drucker, D. Wu, and V. Vapnik. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks, 10(5):1048--1054, 1999.
|
| |
9
|
|
| |
10
|
C. Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973--978, 2001.
|
| |
11
|
|
| |
12
|
|
| |
13
|
C. Genest and J. Zidek. Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1986.
|
| |
14
|
|
| |
15
|
P. Graham. A plan for spam http://www.paulgraham.com/spam.html, 2002.
|
| |
16
|
G. Hinton. Products of experts. In Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN99), pages 1--6, 1999.
|
| |
17
|
|
| |
18
|
A. Juan and H. Ney. Reversing and smoothing the multinomial naive Bayes text classifier. In Proceedings of the 2nd Int. Workshop on Pattern Recognition in Information Systems (PRIS 2002), pages 200--212, 2002.
|
| |
19
|
|
 |
20
|
|
| |
21
|
A. Kołcz and J. Alspector. SVM-based filtering of e-mail spam with content-specific misclassification costs. In Proceedings of the Workshop on Text Mining (TextDM'2001), 2001.
|
| |
22
|
M. Lee and E. Corlett. Sequential sampling models of human text classification. Cognitive Science, 27(2):159--1193, 2003.
|
| |
23
|
|
| |
24
|
A. K. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
|
| |
25
|
V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with Naive Bayes - which Naive Bayes? In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS-2006), 2006.
|
 |
26
|
|
| |
27
|
|
| |
28
|
R. Raina, Y. Shen, A. Ng, and A. McCallum. Classification with hybrid generative/discriminative models. In Proceedings of NIPS 16, 2004.
|
| |
29
|
J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of Naive Bayes text classifiers. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.
|
| |
30
|
|
| |
31
|
M. Sauban and B. Pfahringer. Text categorisation using document profiling. In Proceedings of PKDD 2003, pages 411--422, 2003.
|
| |
32
|
K. Schneider. Techniques for improving the performance of naive Bayes for text classification. In Proceedings of CICLing 2005, pages 682--693, 2005.
|
| |
33
|
P. Soucy and G. Mineau. Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005), pages 1130--1135, 2005.
|
 |
34
|
Haoran Wu , Tong Heng Phang , Bing Liu , Xiaoli Li, A refinement approach to handling model misfit in text categorization, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
[doi> 10.1145/775047.775078]
|
| |
35
|
W. Yih, J. Goodman, and G. Hulten. Learning at low false positive rates. In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS-2006), 2006.
|
| |
36
|
|
CITED BY 2
|
|
|
|
|
Steven Bethard , Soumya Ghosh , James H. Martin , Tamara Sumner, Topic model methods for automatically identifying out-of-scope resources, Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, June 15-19, 2009, Austin, TX, USA
|
|