ACM Home Page
Please provide us with feedback. Feedback
Partitioned logistic regression for spam filtering
Full text PdfPdf (232 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Las Vegas, Nevada, USA
SESSION: Research papers table of contents
Pages 97-105  
Year of Publication: 2008
ISBN:978-1-60558-193-4
Authors
Ming-wei Chang  University of Illinois Urbana Champaign, Urbana, IL, USA
Wen-tau Yih  Microsoft Research, Redmond, WA, USA
Christopher Meek  Microsoft Research, Redmond, WA, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 24,   Downloads (12 Months): 290,   Citation Count: 0
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1401890.1401907
What is a DOI?

ABSTRACT

Naive Bayes and logistic regression perform well in different regimes. While the former is a very simple generative model which is efficient to train and performs well empirically in many applications,the latter is a discriminative model which often achieves better accuracy and can be shown to outperform naive Bayes asymptotically. In this paper, we propose a novel hybrid model, partitioned logistic regression, which has several advantages over both naive Bayes and logistic regression. This model separates the original feature space into several disjoint feature groups. Individual models on these groups of features are learned using logistic regression and their predictions are combined using the naive Bayes principle to produce a robust final estimation. We show that our model is better both theoretically and empirically. In addition, when applying it in a practical application, email spam filtering, it improves the normalized AUC score at 10% false-positive rate by 28.8% and 23.6% compared to naive Bayes and logistic regression, when using the exact same training examples.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
 
4
S. Bickel and T. Scheffer. Dirichlet-enhanced spam filtering based on biased samples. In Advances in Neural Information Processing Systems 19 (NIPS--2006), pages 161--168, 2007.
 
5
G. Cormack. TREC 2006 spam track overview. In Proceedings of TREC-2006, 2006.
 
6
G. Cormack and T. Lynam. TREC 2005 spam track overview. In Proceedings of TREC-2005, 2005.
 
7
 
8
 
9
 
10
H. Drucker, D. Wu, and V. Vapnik. Support vector machines for Spam categorization. IEEE Transactions on Neural Networks, 10(5):1048--1054, 1999.
 
11
D. Fallows. Spam: How it is hurting email and degrading life on the Internet. Pew Internet and American Life Project, October 2003.
 
12
 
13
J. Goodman and W. Yih. Online discriminative spam filter training. In CEAS-2006, 2006.
 
14
J. He and B. Thiesson. Asymmetric gradient boosting with application to spam filtering. In CEAS-2007, 2007.
15
 
16
G. Hinton. Products of experts. In Proc. of the 9thInternational Conference on Artificial Neural Networks (ICANN99), pages 1--6, 1999.
 
17
 
18
19
 
20
B. Leiba, J. Ossher, V. T. Rajan, R. Segal, and M. N. Wegman. SMTP path analysis. In CEAS-2005, 2005.
21
 
22
D. Lowd and C. Meek. Good word attacks on statistical spam filters. In CEAS--2005, 2005.
 
23
V. Metsis, V. Androutsopoulos, and G. Paliouras. Spam filtering with naive Bayes -- which naive Bayes? In CEAS-2006, 2006.
 
24
A. Ng and M. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Proceedings of NIPS 14, 2002.
 
25
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999.
 
26
R. Raina, Y. Shen, A. Ng, and A. McCallum. Classification with hybrid generative/discriminative models. In Proceedings of NIPS 16, 2004.
 
27
J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of naive Bayes text classifiers. In ICML-2003, 2003.
 
28
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
 
29
G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis,C. D. Spyropoulos, and P. Stamatopoulos. Stacking classifiers for anti-spam filtering of e-mail. In EMNLP-2001,pages 44--50, 2001.
30
 
31
R. Segal. Combining global and personal anti--spam filtering.In CEAS--2007, 2007.
 
32
 
33
A. Smith and M. Osborne. Using gazetteers in discriminative information extraction. In Proceedings of the Tenth Conference on Computational Natural Language Learning(CoNLL-X), pages 133--140, 2006.
 
34
 
35
 
36
W. Yih, J. Goodman, and G. Hulten. Learning at low false positive rates. In CEAS--2006, 2006.
 
37
W. Yih, R. McCann, and A. Kolcz. Improving spam filtering by detecting gray mail. In CEAS--2007, 2007.


Collaborative Colleagues:
Ming-wei Chang: colleagues
Wen-tau Yih: colleagues
Christopher Meek: colleagues