ACM Home Page
Please provide us with feedback. Feedback
Information-theoretic co-clustering
Full text PdfPdf (1.96 MB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Washington, D.C.
SESSION: Research track table of contents
Pages: 89 - 98  
Year of Publication: 2003
ISBN:1-58113-737-0
Authors
Inderjit S. Dhillon  University of Texas, Austin
Subramanyam Mallela  University of Texas, Austin
Dharmendra S. Modha  IBM Almaden Research Center, San Jose, CA
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 52,   Downloads (12 Months): 300,   Citation Count: 83
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/956750.956764
What is a DOI?

ABSTRACT

Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the co-clustering problem as an optimization problem in information theory---the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. We present an innovative co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. Using the practical example of simultaneous word-document clustering, we demonstrate that our algorithm works well in practice, especially in the presence of sparsity and high-dimensionality.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In KDD'03. AAAI Press, 1998.
 
2
 
3
4
 
5
I. S. Dhillon, J. Fan, and Y. Guan. Efficient clustering of very large document collections. In R. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu, editors, Data Mining for Scientific and Engineering Applications, pages 357-381. Kluwer Academic Publishers, 2001.
 
6
 
7
 
8
 
9
R. A. Fisher. On the interpretation of x2 from the contingency tables, and the calculation of p. J. Royal Stat. Soc., 85:87-94, 1922.
 
10
 
11
J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123-129, March 1972.
12
 
13
 
14
 
15
Ken Lang. News Weeder: Learning to filter netnews. In ICML-95, pages 331-339, 1995.
 
16
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. www.cs.cmu.edu/ mccallum/bow, 1996.
 
17
B. Mirkin. Mathematical Classification and Clustering. Kluwer Academic Publishers, 1996.
 
18
N. Slonim, N. Friedman, and N. Tishby. Agglomerative multivariate information bottleneck. In NIPS-14, 2001.
19
 
20
N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368-377, 1999.

CITED BY  84

Collaborative Colleagues:
Inderjit S. Dhillon: colleagues
Subramanyam Mallela: colleagues
Dharmendra S. Modha: colleagues