|
ABSTRACT
Naive Bayes and logistic regression perform well in different regimes. While the former is a very simple generative model which is efficient to train and performs well empirically in many applications,the latter is a discriminative model which often achieves better accuracy and can be shown to outperform naive Bayes asymptotically. In this paper, we propose a novel hybrid model, partitioned logistic regression, which has several advantages over both naive Bayes and logistic regression. This model separates the original feature space into several disjoint feature groups. Individual models on these groups of features are learned using logistic regression and their predictions are combined using the naive Bayes principle to produce a robust final estimation. We show that our model is better both theoretically and empirically. In addition, when applying it in a practical application, email spam filtering, it improves the normalized AUC score at 10% false-positive rate by 28.8% and 23.6% compared to naive Bayes and logistic regression, when using the exact same training examples.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Ion Androutsopoulos , John Koutsias , Konstantinos V. Chandrinos , Constantine D. Spyropoulos, An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, p.160-167, July 24-28, 2000, Athens, Greece
[doi> 10.1145/345508.345569]
|
 |
2
|
|
| |
3
|
|
| |
4
|
S. Bickel and T. Scheffer. Dirichlet-enhanced spam filtering based on biased samples. In Advances in Neural Information Processing Systems 19 (NIPS--2006), pages 161--168, 2007.
|
| |
5
|
G. Cormack. TREC 2006 spam track overview. In Proceedings of TREC-2006, 2006.
|
| |
6
|
G. Cormack and T. Lynam. TREC 2005 spam track overview. In Proceedings of TREC-2005, 2005.
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
H. Drucker, D. Wu, and V. Vapnik. Support vector machines for Spam categorization. IEEE Transactions on Neural Networks, 10(5):1048--1054, 1999.
|
| |
11
|
D. Fallows. Spam: How it is hurting email and degrading life on the Internet. Pew Internet and American Life Project, October 2003.
|
| |
12
|
|
| |
13
|
J. Goodman and W. Yih. Online discriminative spam filter training. In CEAS-2006, 2006.
|
| |
14
|
J. He and B. Thiesson. Asymmetric gradient boosting with application to spam filtering. In CEAS-2007, 2007.
|
 |
15
|
|
| |
16
|
G. Hinton. Products of experts. In Proc. of the 9thInternational Conference on Artificial Neural Networks (ICANN99), pages 1--6, 1999.
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
| |
20
|
B. Leiba, J. Ossher, V. T. Rajan, R. Segal, and M. N. Wegman. SMTP path analysis. In CEAS-2005, 2005.
|
 |
21
|
|
| |
22
|
D. Lowd and C. Meek. Good word attacks on statistical spam filters. In CEAS--2005, 2005.
|
| |
23
|
V. Metsis, V. Androutsopoulos, and G. Paliouras. Spam filtering with naive Bayes -- which naive Bayes? In CEAS-2006, 2006.
|
| |
24
|
A. Ng and M. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Proceedings of NIPS 14, 2002.
|
| |
25
|
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999.
|
| |
26
|
R. Raina, Y. Shen, A. Ng, and A. McCallum. Classification with hybrid generative/discriminative models. In Proceedings of NIPS 16, 2004.
|
| |
27
|
J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of naive Bayes text classifiers. In ICML-2003, 2003.
|
| |
28
|
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
|
| |
29
|
G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis,C. D. Spyropoulos, and P. Stamatopoulos. Stacking classifiers for anti-spam filtering of e-mail. In EMNLP-2001,pages 44--50, 2001.
|
 |
30
|
|
| |
31
|
R. Segal. Combining global and personal anti--spam filtering.In CEAS--2007, 2007.
|
| |
32
|
|
| |
33
|
A. Smith and M. Osborne. Using gazetteers in discriminative information extraction. In Proceedings of the Tenth Conference on Computational Natural Language Learning(CoNLL-X), pages 133--140, 2006.
|
| |
34
|
|
| |
35
|
|
| |
36
|
W. Yih, J. Goodman, and G. Hulten. Learning at low false positive rates. In CEAS--2006, 2006.
|
| |
37
|
W. Yih, R. McCann, and A. Kolcz. Improving spam filtering by detecting gray mail. In CEAS--2007, 2007.
|
|