ACM Home Page
Please provide us with feedback. Feedback
A class-feature-centroid classifier for text categorization
Full text PdfPdf (1.02 MB)
Source
International World Wide Web Conference archive
Proceedings of the 18th international conference on World wide web table of contents
Madrid, Spain
SESSION: Data mining/session: learning table of contents
Pages 201-210  
Year of Publication: 2009
ISBN:978-1-60558-487-4
Authors
Hu Guan  Shanghai Jiao Tong University, Shanghai, China
Jingyu Zhou  Shanghai Jiao Tong University, Shanghai, China
Minyi Guo  Shanghai Jiao Tong University, Shanghai, China
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 45,   Downloads (12 Months): 192,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1526709.1526737
What is a DOI?

ABSTRACT

Automated text categorization is an important technique for many web applications, such as document indexing, document filtering, and cataloging web resources. Many different approaches have been proposed for the automated text categorization problem. Among them, centroid-based approaches have the advantages of short training time and testing time due to its computational efficiency. As a result, centroid-based classifiers have been widely used in many web applications. However, the accuracy of centroid-based classifiers is inferior to SVM, mainly because centroids found during construction are far from perfect locations.

We design a fast Class-Feature-Centroid (CFC) classifier for multi-class, single-label text categorization. In CFC, a centroid is built from two important class distributions: inter-class term index and inner-class term index. CFC proposes a novel combination of these indices and employs a denormalized cosine measure to calculate the similarity score between a text vector and a centroid. Experiments on the Reuters-21578 corpus and 20-newsgroup email collection show that CFC consistently outperforms the state-of-the-art SVM classifiers on both micro-F1 and macro-F1 scores. Particularly, CFC is more effective and robust than SVM when data is sparse.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
Z. Cataltepe and E. Aygun. An improvement of centroid-based classification algorithm for text classification. IEEE 23rd International Conference on Data Engineering Workshop, 1-2:952--956, 2007.
 
4
R. N. Chau, C. S. Yeh, and K. A. Smith. A neural network model for hierarchical multilingual text categorization. Advances in Neural Networks, LNCS, 3497:238--245, 2005.
5
6
7
 
8
G. D. Guo, H. Wang, D. Bell, Y. X. Bi, and K. Greer. Using kNN model for automatic text categorization. Soft Computing, 10(5):423--430, 2006.
 
9
 
10
A. M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes. Multinomial naive bayes for text categorization revisited. AI 2004: Advances in Artificial Intelligence, 3339:488--499, 2004.
 
11
 
12
 
13
 
14
 
15
 
16
V. Lertnattee and T. Theeramunkong. Class normalization in centroid-based text categorization. Information Sciences, 176(12):1712--1738, 2006.
 
17
D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning, pages 148--156, 1994.
18
 
19
 
20
A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
 
21
22
 
23
Z. L. Pei, X. H. Shi, M. Marchese, and Y. C. Liang. An enhanced text categorization method based on improved text frequency approach and mutual information algorithm. Progress in Natural Science, 17(12):1494--1500, 2007.
24
 
25
S. Robertson. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation, 60:503--520, 2004.
26
 
27
K. M. Schneider. Weighted average pointwise mutual information for feature selection in text categorization. Knowledge Discovery in Databases: PKDD 2005, 3721:252--263, 2005.
28
 
29
S. Shankar and G. Karypis. Weight Adjustment Schemes for a Centroid Based Classifier. Army High Performance Computing Research Center, 2000.
 
30
P. Soucy and G. W. Mineau. Feature selection strategies for text categorization. Advances in Artificial Intelligence, Proceedings, 2671:505--509, 2003.
 
31
 
32
33
 
34
 
35
36
37
38
 
39

Collaborative Colleagues:
Hu Guan: colleagues
Jingyu Zhou: colleagues
Minyi Guo: colleagues