|
ABSTRACT
Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the co-clustering problem as an optimization problem in information theory---the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. We present an innovative co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. Using the practical example of simultaneous word-document clustering, we demonstrate that our algorithm works well in practice, especially in the presence of sparsity and high-dimensionality.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In KDD'03. AAAI Press, 1998.
|
| |
2
|
|
| |
3
|
|
 |
4
|
|
| |
5
|
I. S. Dhillon, J. Fan, and Y. Guan. Efficient clustering of very large document collections. In R. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu, editors, Data Mining for Scientific and Engineering Applications, pages 357-381. Kluwer Academic Publishers, 2001.
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
R. A. Fisher. On the interpretation of x2 from the contingency tables, and the calculation of p. J. Royal Stat. Soc., 85:87-94, 1922.
|
| |
10
|
|
| |
11
|
J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123-129, March 1972.
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
Ken Lang. News Weeder: Learning to filter netnews. In ICML-95, pages 331-339, 1995.
|
| |
16
|
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. www.cs.cmu.edu/ mccallum/bow, 1996.
|
| |
17
|
B. Mirkin. Mathematical Classification and Clustering. Kluwer Academic Publishers, 1996.
|
| |
18
|
N. Slonim, N. Friedman, and N. Tishby. Agglomerative multivariate information bottleneck. In NIPS-14, 2001.
|
 |
19
|
|
| |
20
|
N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368-377, 1999.
|
CITED BY 86
|
|
Wenyuan Dai , Gui-Rong Xue , Qiang Yang , Yong Yu, Co-clustering based classification for out-of-domain documents, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Arindam Banerjee , Inderjit Dhillon , Joydeep Ghosh , Srujana Merugu , Dharmendra S. Modha, A generalized maximum entropy approach to bregman co-clustering and matrix approximation, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
Deepayan Chakrabarti , Spiros Papadimitriou , Dharmendra S. Modha , Christos Faloutsos, Fully automatic cross-associations, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
|
|
|
Bin Gao , Tie-Yan Liu , Xin Zheng , Qian-Sheng Cheng , Wei-Ying Ma, Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bin Gao , Tie-Yan Liu , Guang Feng , Tao Qin , Qian-Sheng Cheng , Wei-Ying Ma, Hierarchical Taxonomy Preparation for Text Categorization Using Consistent Bipartite Spectral Graph Copartitioning, IEEE Transactions on Knowledge and Data Engineering, v.17 n.9, p.1263-1273, September 2005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lu Liu , Lifeng Sun , Yong Rui , Yao Shi , Shiqiang Yang, Web video topic discovery and tracking via bipartite graph reinforcement model, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
Jian-Tao Sun , Xuanhui Wang , Dou Shen , Hua-Jun Zeng , Zheng Chen, Mining clickthrough data for collaborative web search, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bingjun Sun , Ding Zhou , Hongyuan Zha , John Yen, Multi-task text segmentation and alignment based on weighted mutual information, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
Wei Tang , Hui Xiong , Shi Zhong , Jie Wu, Enhancing semi-supervised clustering: a feature projection perspective, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
Bo Long , Zhongfei (Mark) Zhang , Xiaoyun Wú , Philip S. Yu, Spectral clustering for multi-type relational data, Proceedings of the 23rd international conference on Machine learning, p.585-592, June 25-29, 2006, Pittsburgh, Pennsylvania
|
|
|
|
|
|
|
|
|
Bo Long , Xiaoyun Wu , Zhongfei (Mark) Zhang , Philip S. Yu, Unsupervised learning on k-partite graphs, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
Zheng-Yu Niu , Dong-Hong Ji , Chew Lim Tan, A semi-supervised feature clustering algorithm with application to word sense disambiguation, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, p.907-914, October 06-08, 2005, Vancouver, British Columbia, Canada
|
|
|
|
|
|
Gaurav Pandey , Gowtham Atluri , Michael Steinbach , Chad L. Myers , Vipin Kumar, An association analysis approach to biclustering, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|
|
|
|
|
Bingjun Sun , Prasenjit Mitra , C. Lee Giles , John Yen , Hongyuan Zha, Topic segmentation with shared topic detection and alignment of multiple documents, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
Bo Long , Zhongfei (Mark) Zhang , Xiaoyun Wu , Philip S. Yu, Relational clustering by symmetric convex coding, Proceedings of the 24th international conference on Machine learning, p.569-576, June 20-24, 2007, Corvalis, Oregon
|
|
|
|
|
|
|
|
|
|
|
|
Carlotta Domeniconi , Dimitrios Gunopulos , Sheng Ma , Bojun Yan , Muna Al-Razgan , Dimitris Papadopoulos, Locally adaptive metrics for clustering high dimensional data, Data Mining and Knowledge Discovery, v.14 n.1, p.63-97, February 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Wenyuan Dai , Qiang Yang , Gui-Rong Xue , Yong Yu, Self-taught clustering, Proceedings of the 25th international conference on Machine learning, p.200-207, July 05-09, 2008, Helsinki, Finland
|
|
|
|
|
|
|
|
|
|
|
|
Xiao Ling , Gui-Rong Xue , Wenyuan Dai , Yun Jiang , Qiang Yang , Yong Yu, Can chinese web pages be classified with english data source?, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
Jimeng Sun , Christos Faloutsos , Spiros Papadimitriou , Philip S. Yu, GraphScope: parameter-free mining of large time-evolving graphs, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dingding Wang , Shenghuo Zhu , Tao Li , Yun Chi , Yihong Gong, Integrating clustering and multi-document summarization to improve document understanding, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Meghana Deodhar , Gunjan Gupta , Joydeep Ghosh , Hyuk Cho , Inderjit Dhillon, A scalable framework for discovering coherent co-clusters in noisy data, Proceedings of the 26th Annual International Conference on Machine Learning, p.241-248, June 14-18, 2009, Montreal, Quebec, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|