| High-performing feature selection for text classification |
| Full text |
Pdf
(94 KB)
|
| Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the eleventh international conference on Information and knowledge management
table of contents
McLean, Virginia, USA
SESSION: Poster session
table of contents
Pages: 659 - 661
Year of Publication: 2002
ISBN:1-58113-492-4
|
|
Authors
|
|
Monica Rogati
|
CSD, Carnegie Mellon University, Pittsburgh, PA
|
|
Yiming Yang
|
CSD, Carnegie Mellon University, Pittsburgh, PA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 18, Downloads (12 Months): 157, Citation Count: 16
|
|
|
ABSTRACT
This paper reports a controlled study on a large number of filter feature selection methods for text classification. Over 100 variants of five major feature selection criteria were examined using four well-known classification algorithms: a Naive Bayesian (NB) approach, a Rocchio-style classifier, a k-nearest neighbor (kNN) method and a Support Vector Machine (SVM) system. Two benchmark collections were chosen as the testbeds: Reuters-21578 and small portion of Reuters Corpus Version 1 (RCV1), making the new results comparable to published results. We found that feature selection methods based on chi2 statistics consistently outperformed those based on other criteria (including information gain) for all four classifiers and both data collections, and that a further increase in performance was obtained by combining uncorrelated and high-performing feature selection methods.The results we obtained using only 3% of the available features are among the best reported, including results obtained with the full feature set.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
T. Joachims. Making large-scale support vector machine learning practical, 1998.
|
| |
4
|
G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In International Conference on Machine Learning, pages 121--129, 1994.
|
| |
5
|
D. Koller and M. Sahami. Toward optimal feature selection. In International Conference on Machine Learning, pages 284--292, 1996.
|
| |
6
|
T. Lewis, F. Li, R. Tony, and Y. Yang. The reuters corpus volume i as a text categorization test collection. 2002.
|
| |
7
|
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.
|
| |
8
|
J. Moore, E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, and B. Mobasher. Web page categorization and feature selection using association rule and principal component clustering, 1997.
|
| |
9
|
|
| |
10
|
P. Soucy and P. Mineau. A simple feature selection method for text classification. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 897--902, 2001.
|
| |
11
|
|
| |
12
|
|
|