ACM Home Page
Please provide us with feedback. Feedback
High-performing feature selection for text classification
Full text PdfPdf (94 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the eleventh international conference on Information and knowledge management table of contents
McLean, Virginia, USA
SESSION: Poster session table of contents
Pages: 659 - 661  
Year of Publication: 2002
ISBN:1-58113-492-4
Authors
Monica Rogati  CSD, Carnegie Mellon University, Pittsburgh, PA
Yiming Yang  CSD, Carnegie Mellon University, Pittsburgh, PA
Sponsors
SIGMIS: ACM Special Interest Group on Management Information Systems
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 18,   Downloads (12 Months): 157,   Citation Count: 16
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/584792.584911
What is a DOI?

ABSTRACT

This paper reports a controlled study on a large number of filter feature selection methods for text classification. Over 100 variants of five major feature selection criteria were examined using four well-known classification algorithms: a Naive Bayesian (NB) approach, a Rocchio-style classifier, a k-nearest neighbor (kNN) method and a Support Vector Machine (SVM) system. Two benchmark collections were chosen as the testbeds: Reuters-21578 and small portion of Reuters Corpus Version 1 (RCV1), making the new results comparable to published results. We found that feature selection methods based on chi2 statistics consistently outperformed those based on other criteria (including information gain) for all four classifiers and both data collections, and that a further increase in performance was obtained by combining uncorrelated and high-performing feature selection methods.The results we obtained using only 3% of the available features are among the best reported, including results obtained with the full feature set.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
T. Joachims. Making large-scale support vector machine learning practical, 1998.
 
4
G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In International Conference on Machine Learning, pages 121--129, 1994.
 
5
D. Koller and M. Sahami. Toward optimal feature selection. In International Conference on Machine Learning, pages 284--292, 1996.
 
6
T. Lewis, F. Li, R. Tony, and Y. Yang. The reuters corpus volume i as a text categorization test collection. 2002.
 
7
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.
 
8
J. Moore, E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, and B. Mobasher. Web page categorization and feature selection using association rule and principal component clustering, 1997.
 
9
 
10
P. Soucy and P. Mineau. A simple feature selection method for text classification. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 897--902, 2001.
 
11
 
12

CITED BY  16

Collaborative Colleagues:
Monica Rogati: colleagues
Yiming Yang: colleagues