|
ABSTRACT
Automated text categorization is an important technique for many web applications, such as document indexing, document filtering, and cataloging web resources. Many different approaches have been proposed for the automated text categorization problem. Among them, centroid-based approaches have the advantages of short training time and testing time due to its computational efficiency. As a result, centroid-based classifiers have been widely used in many web applications. However, the accuracy of centroid-based classifiers is inferior to SVM, mainly because centroids found during construction are far from perfect locations. We design a fast Class-Feature-Centroid (CFC) classifier for multi-class, single-label text categorization. In CFC, a centroid is built from two important class distributions: inter-class term index and inner-class term index. CFC proposes a novel combination of these indices and employs a denormalized cosine measure to calculate the similarity score between a text vector and a centroid. Experiments on the Reuters-21578 corpus and 20-newsgroup email collection show that CFC consistently outperforms the state-of-the-art SVM classifiers on both micro-F1 and macro-F1 scores. Particularly, CFC is more effective and robust than SVM when data is sparse.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Andrei Z. Broder , Marcus Fontoura , Evgeniy Gabrilovich , Amruta Joshi , Vanja Josifovski , Tong Zhang, Robust classification of rare queries using web knowledge, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277783]
|
| |
3
|
Z. Cataltepe and E. Aygun. An improvement of centroid-based classification algorithm for text classification. IEEE 23rd International Conference on Data Engineering Workshop, 1-2:952--956, 2007.
|
| |
4
|
R. N. Chau, C. S. Yeh, and K. A. Smith. A neural network model for hierarchical multilingual text categorization. Advances in Neural Networks, LNCS, 3497:238--245, 2005.
|
 |
5
|
|
 |
6
|
|
 |
7
|
|
| |
8
|
G. D. Guo, H. Wang, D. Bell, Y. X. Bi, and K. Greer. Using kNN model for automatic text categorization. Soft Computing, 10(5):423--430, 2006.
|
| |
9
|
|
| |
10
|
A. M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes. Multinomial naive bayes for text categorization revisited. AI 2004: Advances in Artificial Intelligence, 3339:488--499, 2004.
|
| |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
V. Lertnattee and T. Theeramunkong. Class normalization in centroid-based text categorization. Information Sciences, 176(12):1712--1738, 2006.
|
| |
17
|
D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning, pages 148--156, 1994.
|
 |
18
|
|
| |
19
|
|
| |
20
|
A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
|
| |
21
|
|
 |
22
|
Xiaochuan Ni , Gui-Rong Xue , Xiao Ling , Yong Yu , Qiang Yang, Exploring in the weblog space by detecting informative and affective articles, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
[doi> 10.1145/1242572.1242611]
|
| |
23
|
Z. L. Pei, X. H. Shi, M. Marchese, and Y. C. Liang. An enhanced text categorization method based on improved text frequency approach and mutual information algorithm. Progress in Natural Science, 17(12):1494--1500, 2007.
|
 |
24
|
Paul Resnick , Neophytos Iacovou , Mitesh Suchak , Peter Bergstrom , John Riedl, GroupLens: an open architecture for collaborative filtering of netnews, Proceedings of the 1994 ACM conference on Computer supported cooperative work, p.175-186, October 22-26, 1994, Chapel Hill, North Carolina, United States
[doi> 10.1145/192844.192905]
|
| |
25
|
S. Robertson. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation, 60:503--520, 2004.
|
 |
26
|
|
| |
27
|
K. M. Schneider. Weighted average pointwise mutual information for feature selection in text categorization. Knowledge Discovery in Databases: PKDD 2005, 3721:252--263, 2005.
|
 |
28
|
|
| |
29
|
S. Shankar and G. Karypis. Weight Adjustment Schemes for a Centroid Based Classifier. Army High Performance Computing Research Center, 2000.
|
| |
30
|
P. Soucy and G. W. Mineau. Feature selection strategies for text categorization. Advances in Artificial Intelligence, Proceedings, 2671:505--509, 2003.
|
| |
31
|
|
| |
32
|
|
 |
33
|
|
| |
34
|
|
| |
35
|
Sholom M. Weiss , Chidanand Apte , Fred J. Damerau , David E. Johnson , Frank J. Oles , Thilo Goetz , Thomas Hampp, Maximizing Text-Mining Performance, IEEE Intelligent Systems, v.14 n.4, p.63-69, July 1999
[doi> 10.1109/5254.784086]
|
 |
36
|
Haoran Wu , Tong Heng Phang , Bing Liu , Xiaoli Li, A refinement approach to handling model misfit in text categorization, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
[doi> 10.1145/775047.775078]
|
 |
37
|
|
 |
38
|
|
| |
39
|
|
|