ACM Home Page
Please provide us with feedback. Feedback
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5
Full text PdfPdf (264 KB)
Source ACM International Conference Proceeding Series; Vol. 69 archive
Proceedings of the twenty-first international conference on Machine learning table of contents
Banff, Alberta, Canada
Page: 41  
Year of Publication: 2004
ISBN:1-58113-828-5
Authors
Evgeniy Gabrilovich  Israel Institute of Technology, Haifa, Israel
Shaul Markovitch  Israel Institute of Technology, Haifa, Israel
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 22,   Downloads (12 Months): 172,   Citation Count: 19
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1015330.1015388
What is a DOI?

ABSTRACT

Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge numbers of features. Most previous studies found that the majority of these features are relevant for classification, and that the performance of text categorization with support vector machines peaks when no feature selection is performed. We describe a class of text categorization problems that are characterized with many redundant features. Even though most of these features are relevant, the underlying concepts can be concisely captured using only a few features, while keeping all of them has substantially detrimental effect on categorization accuracy. We develop a novel measure that captures feature redundancy, and use it to analyze a large collection of datasets. We show that for problems plagued with numerous redundant features the performance of C4.5 is significantly superior to that of SVM, while aggressive feature selection allows SVM to beat C4.5 by a narrow margin.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Bekkerman, R. (2003). Distributional clustering of words for text categorization. Master's thesis, CS Department, Technion---Israel Inst. of Technology.
 
2
Brank, J., Grobelnik, M., Milic-Frayling, N., & Mladenic, D. (2002). Interaction of feature selection methods and linear classification models. Workshop on Text Learning held at ICML-2002.
3
4
5
 
6
Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. John Wiley and Sons.
7
 
8
Fellbaum, C. (Ed.). (1998). Wordnet: An electronic lexical database. MIT Press.
 
9
 
10
 
11
 
12
Lang, K. (1995). Newsweeder: Learning to filter net-news. ICML'95 (pp. 331--339).
 
13
 
14
 
15
 
16
Mladenic, D., & Grobelnik, M. (1998). Word sequences as features in text-learning. Proc. of 7th Electrotech. and Comp. Sci. Conf. (pp. 145--148).
 
17
 
18
 
19
Reuters (1997). Reuters-21578 text categorization test collection, Distribution 1.0. Reuters. http://www.daviddlewis.com/resources/testcollections/reuters21578.
20
 
21
22
 
23
 
24
25
 
26
 
27

CITED BY  19
Collaborative Colleagues:
Evgeniy Gabrilovich: colleagues
Shaul Markovitch: colleagues