ACM Home Page
Please provide us with feedback. Feedback
Genre-based decomposition of email class noise
Full text MovMov (26:24),  PdfPdf (451 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Paris, France
SESSION: Research track papers table of contents
Pages 427-436  
Year of Publication: 2009
ISBN:978-1-60558-495-9
Authors
Aleksander Kolcz  Microsoft, Redmond, USA
Gordon V. Cormack  University of Waterloo, Waterloo, Canada
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 27,   Downloads (12 Months): 85,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1557019.1557070
What is a DOI?

ABSTRACT

Corruption of data by class-label noise is an important practical concern impacting many classification problems. Studies of data cleaning techniques often assume a uniform label noise model, however, which is seldom realized in practice. Relatively little is understood, as to how the natural label noise distribution can be measured or simulated. Using email spam-filtering data, we demonstrate that class noise can have substantial content specific bias. We also demonstrate that noise detection techniques based on classifier confidence tend to identify instances that human assessors are likely to label in error. We show that genre modeling can be very informative in identifying potential areas of mislabeling. Moreover, we are able to show that genre decomposition can also be used to substantially improve spam filtering accuracy, with our results outperforming the best published figures for the trec05-p1 and ceas-2008 benchmark collections.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
The CEAS 2008 live spam challenge. http://www.ceas.cc/2008/challenge/challenge.html, 2007.
2
 
3
A. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145--1159, 1997.
 
4
 
5
C. E. Brodley and M. A. Friedl. Identifying mislabeled training data. JAIR, 11:131--167, 1999.
 
6
G. V. Cormack. University of waterloo participation in the trec 2007 spam track. In Sixteenth Text REtrieval Conference (TREC-2007), Gaithersburg, MD, 2007. NIST.
 
7
 
8
G. V. Cormack and T. R. Lynam. TREC 2005 Spam Track overview. http://plg.uwaterloo.ca/gvcormac/trecspamtrack05, 2005.
9
10
 
11
J. Goodman and W. tau Yih. Online discriminative spam filter training. In The Third Conference on Email and Anti-Spam, Mountain View, CA, 2006.
 
12
J. Graham-Cumming. SpamOrHam. Virus Bulletin, 2006-06-01.
 
13
A. KoBcz and J. Alspector. SVM-based filtering of e-mail spam with content-specific misclassification costs. In Proceedings of the Workshop on Text Mining (TextDM'2001), 2001.
 
14
 
15
E. S. Raymond, D. Relson, M. Andree, and G. Louis. Bogofilter. http://bogofilter.sourceforge.net/, 2004.
 
16
 
17
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, 1998.
 
18
D. Sculley and G. V. Cormack. Filtering spam in the presence of noisy user feedback. In Proceedings of the 5th Conference on Email and Anti-Spam (CEAS 2008), 2008.
19
 
20
S. Verbaeten and A. V. Assche. Ensemble methods for noise elimination in classification problems. In Multiple Classifier Systems 2003, pages 317--325. Springer-Verlag, 2003.
 
21
W. Yih, R. McCann, and A. KoBcz. Improving spam filtering by detecting gray mail. In Proceedings of the 4th Conference on Email and Anti-Spam (CEAS 2007), 2007.
 
22
X. Zhu, X. Wu, and Q. Chen. Eliminating class noise in large datasets. In Proceedings of the Twentieth International Conference on Machine Learning, pages 920--927, 2003.

Collaborative Colleagues:
Aleksander Kolcz: colleagues
Gordon V. Cormack: colleagues