|
ABSTRACT
Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge numbers of features. Most previous studies found that the majority of these features are relevant for classification, and that the performance of text categorization with support vector machines peaks when no feature selection is performed. We describe a class of text categorization problems that are characterized with many redundant features. Even though most of these features are relevant, the underlying concepts can be concisely captured using only a few features, while keeping all of them has substantially detrimental effect on categorization accuracy. We develop a novel measure that captures feature redundancy, and use it to analyze a large collection of datasets. We show that for problems plagued with numerous redundant features the performance of C4.5 is significantly superior to that of SVM, while aggressive feature selection allows SVM to beat C4.5 by a narrow margin.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Bekkerman, R. (2003). Distributional clustering of words for text categorization. Master's thesis, CS Department, Technion---Israel Inst. of Technology.
|
| |
2
|
Brank, J., Grobelnik, M., Milic-Frayling, N., & Mladenic, D. (2002). Interaction of feature selection methods and linear classification models. Workshop on Text Learning held at ICML-2002.
|
 |
3
|
Soumen Chakrabarti , Mukul M. Joshi , Kunal Punera , David M. Pennock, The structure of broad topics on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511480]
|
 |
4
|
Doron Cohen , Michael Herscovici , Yael Petruschka , Yoëlle S. Maarek , Aya Soffer, Personalized pocket directories for mobile devices, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511528]
|
 |
5
|
|
| |
6
|
Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. John Wiley and Sons.
|
 |
7
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
| |
8
|
Fellbaum, C. (Ed.). (1998). Wordnet: An electronic lexical database. MIT Press.
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
Lang, K. (1995). Newsweeder: Learning to filter net-news. ICML'95 (pp. 331--339).
|
| |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
Mladenic, D., & Grobelnik, M. (1998). Word sequences as features in text-learning. Proc. of 7th Electrotech. and Comp. Sci. Conf. (pp. 145--148).
|
| |
17
|
|
| |
18
|
|
| |
19
|
Reuters (1997). Reuters-21578 text categorization test collection, Distribution 1.0. Reuters. http://www.daviddlewis.com/resources/testcollections/reuters21578.
|
 |
20
|
|
| |
21
|
|
 |
22
|
|
| |
23
|
|
| |
24
|
|
 |
25
|
|
| |
26
|
|
| |
27
|
|
CITED BY 19
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Anirban Dasgupta , Petros Drineas , Boulos Harb , Vanja Josifovski , Michael W. Mahoney, Feature selection methods for text classification, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sutanu Chakraborti , Rahman Mukras , Robert Lothian , Nirmalie Wiratunga , Stuart Watt , David Harper, Supervised latent semantic indexing using adaptive sprinkling, Proceedings of the 20th international joint conference on Artifical intelligence, p.1582-1587, January 06-12, 2007, Hyderabad, India
|
|
|
|
|
|
|
|