ACM Home Page
Please provide us with feedback. Feedback
Enhanced word clustering for hierarchical text classification
Full text PdfPdf (993 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Edmonton, Alberta, Canada
SESSION: Text classification table of contents
Pages: 191 - 200  
Year of Publication: 2002
ISBN:1-58113-567-X
Authors
Inderjit S. Dhillon  Univ. of Texas, Austin
Subramanyam Mallela  Univ. of Texas, Austin
Rahul Kumar  Univ. of Texas, Austin
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
: AAAI
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 122,   Citation Count: 18
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775047.775076
What is a DOI?

ABSTRACT

In this paper we propose a new information-theoretic divisive algorithm for word clustering applied to text classification. In previous work, such "distributional clustering" of features has been found to achieve improvements over feature selection in terms of classification accuracy, especially at lower number of features [2, 28]. However the existing clustering techniques are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value, thus converging to a local minimum. We show that our algorithm minimizes the "within-cluster Jensen-Shannon divergence" while simultaneously maximizing the "between-cluster Jensen-Shannon divergence". In comparison to the previously proposed agglomerative strategies our divisive algorithm achieves higher classification accuracy especially at lower number of features. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20 Newsgroups data set and a 3-level hierarchy of HTML documents collected from Dmoz Open Directory.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
ANSI/IEEE, New York. IEEE Standard for Binary Floating Point Arithmetic, Std 754--1985 edition, 1985.
2
3
 
4
P. Berkhin and J. D. Becher. Learning simple relations: Theory and applications. In Second SIAM Data Mining Conference, pages 420--436, 2002.
 
5
 
6
 
7
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.
 
8
 
9
 
10
 
11
E. Forgy. Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications. Biometrics, 21(3):768, 1965.
 
12
 
13
M. R. Garey, D. S. Johnson, and H. S. Witsenhausen. The complexity of the generalized Lloyd-Max problem. IEEE Trans. Inform. Theory, 28(2):255--256, 1982.
14
 
15
R. M. Gray and D. L. Neuhoff. Quantization. IEEE Trans. Inform. Theory, 44(6):1--63, 1998.
16
 
17
 
18
 
19
S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Stat., 22:79--86, 1951.
 
20
J. Lin. Divergence measures based on the Shannon entropy. IEEE Trans. Inform. Theory, 37(1), 1991.
 
21
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
 
22
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, www.cs.cmu.edu/mccallum/bow, 1996.
 
23
T. Mitchell. Conditions for the equivalence of hierarchical and non-hierarchical bayesian classifiers. Technical report, CALD, CMU, 1998.
 
24
 
25
 
26
 
27
C. E. Shannon. A mathematical theory of communication. Bell System Technical J., 27, 1948.
 
28
N. Slonim and N. Tishby. The power of word clusters for text classification. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
 
29
 
30

CITED BY  18

Collaborative Colleagues:
Inderjit S. Dhillon: colleagues
Subramanyam Mallela: colleagues
Rahul Kumar: colleagues