|
ABSTRACT
In this paper we propose a new information-theoretic divisive algorithm for word clustering applied to text classification. In previous work, such "distributional clustering" of features has been found to achieve improvements over feature selection in terms of classification accuracy, especially at lower number of features [2, 28]. However the existing clustering techniques are agglomerative in nature and result in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we first derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value, thus converging to a local minimum. We show that our algorithm minimizes the "within-cluster Jensen-Shannon divergence" while simultaneously maximizing the "between-cluster Jensen-Shannon divergence". In comparison to the previously proposed agglomerative strategies our divisive algorithm achieves higher classification accuracy especially at lower number of features. We further show that feature clustering is an effective technique for building smaller class models in hierarchical classification. We present detailed experimental results using Naive Bayes and Support Vector Machines on the 20 Newsgroups data set and a 3-level hierarchy of HTML documents collected from Dmoz Open Directory.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
ANSI/IEEE, New York. IEEE Standard for Binary Floating Point Arithmetic, Std 754--1985 edition, 1985.
|
 |
2
|
|
 |
3
|
Ron Bekkerman , Ran El-Yaniv , Naftali Tishby , Yoad Winter, On feature distributional clustering for text categorization, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.146-153, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383976]
|
| |
4
|
P. Berkhin and J. D. Becher. Learning simple relations: Theory and applications. In Second SIAM Data Mining Conference, pages 420--436, 2002.
|
| |
5
|
Soumen Chakrabarti , Byron Dom , Rakesh Agrawal , Prabhakar Raghavan, Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases, Proceedings of the 23rd International Conference on Very Large Data Bases, p.446-455, August 25-29, 1997
|
| |
6
|
|
| |
7
|
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
E. Forgy. Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications. Biometrics, 21(3):768, 1965.
|
| |
12
|
|
| |
13
|
M. R. Garey, D. S. Johnson, and H. S. Witsenhausen. The complexity of the generalized Lloyd-Max problem. IEEE Trans. Inform. Theory, 28(2):255--256, 1982.
|
 |
14
|
|
| |
15
|
R. M. Gray and D. L. Neuhoff. Quantization. IEEE Trans. Inform. Theory, 44(6):1--63, 1998.
|
 |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Stat., 22:79--86, 1951.
|
| |
20
|
J. Lin. Divergence measures based on the Shannon entropy. IEEE Trans. Inform. Theory, 37(1), 1991.
|
| |
21
|
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
|
| |
22
|
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, www.cs.cmu.edu/mccallum/bow, 1996.
|
| |
23
|
T. Mitchell. Conditions for the equivalence of hierarchical and non-hierarchical bayesian classifiers. Technical report, CALD, CMU, 1998.
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
C. E. Shannon. A mathematical theory of communication. Bell System Technical J., 27, 1948.
|
| |
28
|
N. Slonim and N. Tishby. The power of word clusters for text classification. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
|
| |
29
|
|
| |
30
|
|
CITED BY 18
|
|
Wenyuan Dai , Gui-Rong Xue , Qiang Yang , Yong Yu, Co-clustering based classification for out-of-domain documents, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Zheng-Yu Niu , Dong-Hong Ji , Chew Lim Tan, A semi-supervised feature clustering algorithm with application to word sense disambiguation, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, p.907-914, October 06-08, 2005, Vancouver, British Columbia, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xiao Ling , Gui-Rong Xue , Wenyuan Dai , Yun Jiang , Qiang Yang , Yong Yu, Can chinese web pages be classified with english data source?, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|
|
Brett Poulin , Roman Eisner , Duane Szafron , Paul Lu , Russ Greiner , D. S. Wishart , Alona Fyshe , Brandon Pearcy , Cam MacDonell , John Anvik, Visual explanation of evidence in additive classifiers, Proceedings of the 18th conference on Innovative applications of artificial intelligence, p.1822-1829, July 16-20, 2006, Boston, Massachusetts
|
|