ACM Home Page
Please provide us with feedback. Feedback
Feature selection methods for text classification
Full text MovMov (19:55),  PdfPdf (1.02 MB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Jose, California, USA
SESSION: Research track papers table of contents
Pages: 230 - 239  
Year of Publication: 2007
ISBN:978-1-59593-609-7
Authors
Anirban Dasgupta  Yahoo Research
Petros Drineas  RPI
Boulos Harb  U Penn
Vanja Josifovski  Yahoo Research
Michael W. Mahoney  Yahoo Research
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 53,   Downloads (12 Months): 338,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281192.1281220
What is a DOI?

ABSTRACT

We consider feature selection for text classification both theoretically and empirically. Our main result is an unsupervised feature selection strategy for which we give worst-case theoretical guarantees on the generalization power of the resultant classification function f with respect to the classification function f obtained when keeping all the features. To the best of our knowledge, this is the first feature selection method with such guarantees. In addition, the analysis leads to insights as to when and why this feature selection strategy will perform well in practice. We then use the TechTC-100, 20-Newsgroups, and Reuters-RCV2 data sets to evaluate empirically the performance of this and two simpler but related feature selection strategies against two commonly-used strategies. Our empirical evaluation shows that the strategy with provable performance guarantees performs well in comparison with other commonly-used feature selection strategies. In addition, it performs better on certain datasets under very aggressive feature selection.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
20 Newsgroups Dataset. J. Rennie. http://people.csail.mit.edu/jrennie/20Newsgroups/.
 
2
20 Newsgroups Dataset. UCI KDD Archive. http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html.
3
 
4
 
5
 
6
 
7
A. Das and D. Kempe. Algorithms for subset selection in linear regression. Manuscript.
 
8
9
 
10
11
 
12
T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1):1--50, 1999.
 
13
14
15
16
 
17
 
18
 
19
 
20
 
21
K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the 12th International Conference on Machine Learning, pages 331--339, 1995.
 
22
 
23
 
24
 
25
T. Poggio and S. Smale. The mathematics of learning: Dealing with data. Notices of the AMS, 50(5):537--544, May 2003.
 
26
 
27
R. Rifkin, G. Yeo, and T. Poggio. Regularized least-squares classification. In J. A. K. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, editors, Advances in Learning Theory: Methods, Models and Applications, NATO Science Series III: Computer and Systems Sciences, pages 131--154. VIOS Press, 2003.
 
28
M. Rudelson and R. Vershynin. Sampling from large matrices: an approach through geometric functional analysis. Manuscript. Salton. Proceedings of the 14th Annual International ACM SIGIR Conference, pages 356--358, 1991.
 
29
30
 
31
G. W. Stewart and J. G. Sun. Matrix Perturbation Theory. Academic Press, New York, 1990.
 
32
 
33
A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-Posed Problems. W. H. Winston, Washington, D. C., 1977.
 
34
V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
35
 
36
Y. Yang. Sampling strategies and learning efficiency in text categorization. In AAAI Spring Symposium on Machine Learning in Information Access, pages 88--95, 1996.
37
 
38
39
 
40
 
41


Collaborative Colleagues:
Anirban Dasgupta: colleagues
Petros Drineas: colleagues
Boulos Harb: colleagues
Vanja Josifovski: colleagues
Michael W. Mahoney: colleagues