| Genre-based decomposition of email class noise |
| Full text |
Mov
(26:24),
Pdf
(451 KB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Paris, France
SESSION: Research track papers
table of contents
Pages 427-436
Year of Publication: 2009
ISBN:978-1-60558-495-9
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 27, Downloads (12 Months): 85, Citation Count: 0
|
|
|
ABSTRACT
Corruption of data by class-label noise is an important practical concern impacting many classification problems. Studies of data cleaning techniques often assume a uniform label noise model, however, which is seldom realized in practice. Relatively little is understood, as to how the natural label noise distribution can be measured or simulated. Using email spam-filtering data, we demonstrate that class noise can have substantial content specific bias. We also demonstrate that noise detection techniques based on classifier confidence tend to identify instances that human assessors are likely to label in error. We show that genre modeling can be very informative in identifying potential areas of mislabeling. Moreover, we are able to show that genre decomposition can also be used to substantially improve spam filtering accuracy, with our results outperforming the best published figures for the trec05-p1 and ceas-2008 benchmark collections.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
The CEAS 2008 live spam challenge. http://www.ceas.cc/2008/challenge/challenge.html, 2007.
|
 |
2
|
|
| |
3
|
A. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145--1159, 1997.
|
| |
4
|
|
| |
5
|
C. E. Brodley and M. A. Friedl. Identifying mislabeled training data. JAIR, 11:131--167, 1999.
|
| |
6
|
G. V. Cormack. University of waterloo participation in the trec 2007 spam track. In Sixteenth Text REtrieval Conference (TREC-2007), Gaithersburg, MD, 2007. NIST.
|
| |
7
|
|
| |
8
|
G. V. Cormack and T. R. Lynam. TREC 2005 Spam Track overview. http://plg.uwaterloo.ca/gvcormac/trecspamtrack05, 2005.
|
 |
9
|
|
 |
10
|
Nilesh Dalvi , Pedro Domingos , Mausam , Sumit Sanghai , Deepak Verma, Adversarial classification, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
[doi> 10.1145/1014052.1014066]
|
| |
11
|
J. Goodman and W. tau Yih. Online discriminative spam filter training. In The Third Conference on Email and Anti-Spam, Mountain View, CA, 2006.
|
| |
12
|
J. Graham-Cumming. SpamOrHam. Virus Bulletin, 2006-06-01.
|
| |
13
|
A. KoBcz and J. Alspector. SVM-based filtering of e-mail spam with content-specific misclassification costs. In Proceedings of the Workshop on Text Mining (TextDM'2001), 2001.
|
| |
14
|
|
| |
15
|
E. S. Raymond, D. Relson, M. Andree, and G. Louis. Bogofilter. http://bogofilter.sourceforge.net/, 2004.
|
| |
16
|
|
| |
17
|
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, 1998.
|
| |
18
|
D. Sculley and G. V. Cormack. Filtering spam in the presence of noisy user feedback. In Proceedings of the 5th Conference on Email and Anti-Spam (CEAS 2008), 2008.
|
 |
19
|
|
| |
20
|
S. Verbaeten and A. V. Assche. Ensemble methods for noise elimination in classification problems. In Multiple Classifier Systems 2003, pages 317--325. Springer-Verlag, 2003.
|
| |
21
|
W. Yih, R. McCann, and A. KoBcz. Improving spam filtering by detecting gray mail. In Proceedings of the 4th Conference on Email and Anti-Spam (CEAS 2007), 2007.
|
| |
22
|
X. Zhu, X. Wu, and Q. Chen. Eliminating class noise in large datasets. In Proceedings of the Twentieth International Conference on Machine Learning, pages 920--927, 2003.
|
|