ACM Home Page
Please provide us with feedback. Feedback
A pitfall and solution in multi-class feature selection for text classification
Full text PdfPdf (279 KB)
Source ACM International Conference Proceeding Series; Vol. 69 archive
Proceedings of the twenty-first international conference on Machine learning table of contents
Banff, Alberta, Canada
Page: 38  
Year of Publication: 2004
ISBN:1-58113-828-5
Author
George Forman  Hewlett-Packard Labs, Palo Alto, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 57,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1015330.1015356
What is a DOI?

ABSTRACT

Information Gain is a well-known and empirically proven method for high-dimensional feature selection. We found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root cause, we find that a large class of feature scoring methods suffers a pitfall: they can be blinded by a surplus of strongly predictive features for some classes, while largely ignoring features needed to discriminate difficult classes. In this paper we demonstrate this pitfall hurts performance even for a relatively uniform text classification task. Based on this understanding, we present solutions inspired by round-robin scheduling that avoid this pitfall, without resorting to costly wrapper methods. Empirical evaluation on 19 datasets shows substantial improvements.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
Mladenic, D., & Grobelnik, M. (1998). Word sequences as features in text learning. 17th Electrotechnical and Computer Science Conference.
 
9
Rennie, J. (2001). Improving multi-class text classification with naive Bayes. Master's thesis, Massachusetts Institute of Technology (ch. 5).
 
10
11
 
12



REVIEW

"Fabrizio Sebastiani : Reviewer"

Text classification (also known as categorization) is the task of filing, given a predefined set of classes, textual documents, under the class (or classes) in which they belong, based on an analysis of the documents' contents. Classifiers are aut  more...