|
ABSTRACT
Large, sparse binary matrices arise in numerous data mining applications, such as the analysis of market baskets, web graphs, social networks, co-citations, as well as information retrieval, collaborative filtering, sparse matrix reordering, etc. Virtually all popular methods for the analysis of such matrices---e.g., k-means clustering, METIS graph partitioning, SVD/PCA and frequent itemset mining---require the user to specify various parameters, such as the number of clusters, number of principal components, number of partitions, and "support." Choosing suitable values for such parameters is a challenging problem.Cross-association is a joint decomposition of a binary matrix into disjoint row and column groups such that the rectangular intersections of groups are homogeneous. Starting from first principles, we furnish a clear, information-theoretic criterion to choose a good cross-association as well as its parameters, namely, the number of row and column groups. We provide scalable algorithms to approach the optimal. Our algorithm is parameter-free, and requires no user intervention. In practice it scales linearly with the problem size, and is thus applicable to very large matrices. Finally, we present experiments on multiple synthetic and real-life datasets, where our method gives high-quality, intuitive results.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
G. Hamerly and C. Elkan, "Learning the k in k-means," in Proc. 17th NIPS, 2003.
|
| |
3
|
|
| |
4
|
|
 |
5
|
Sudipto Guha , Rajeev Rastogi , Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States
|
 |
6
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
7
|
|
| |
8
|
A. Hinneburg and D. A. Keim, "An efficient approach to clustering in large multimedia databases with noise," in Proc. 4th KDD, pp. 58--65, 1998.
|
| |
9
|
|
 |
10
|
|
| |
11
|
M. M. Madiman, M. Harrison, and I. Kontoyiannis, "A minimum description length proposal for lossy data compression," in Proc. IEEE ISIT, 2004.
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by latent semantic analysis," JASI, vol. 41, pp. 391--407, 1990.
|
 |
17
|
|
 |
18
|
|
 |
19
|
Christos H. Papadimitriou , Hisao Tamaki , Prabhakar Raghavan , Santosh Vempala, Latent semantic indexing: a probabilistic analysis, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.159-168, June 01-04, 1998, Seattle, Washington, United States
[doi> 10.1145/275487.275505]
|
| |
20
|
|
| |
21
|
A. Y. Ng, M. I. Jordan, and Y. Weiss, "On spectral clustering: Analysis and an algorithm," in Proc. NIPS, pp. 849--856, 2001.
|
| |
22
|
N. Mishra, D. Ron, and R. Swaminathan, "On finding large conjunctive clusters," in Proc. 16th COLT, 2003.
|
| |
23
|
|
 |
24
|
|
| |
25
|
J. Rissanen, "Modeling by shortest data description," Automatica, vol. 14, pp. 465--471, 1978.
|
| |
26
|
J. Rissanen, "Generalized Kraft inequality and arithmetic coding," IBM J. Res. Dev., vol. 20, no. 3, pp. 198--203, 1976.
|
| |
27
|
J. Rissanen and G. G. Langdon Jr., "Arithmetic coding," IBM J. Res. Dev., vol. 23, pp. 149--162, 1979.
|
 |
28
|
|
| |
29
|
J. Rissanen, "Universal prior for integers and estimation by minimum description length," Annals of Statistics, vol. 11, no. 2, pp. 416--431, 1983.
|
| |
30
|
|
| |
31
|
M. Richardson, R. Agrawal, and P. Domingos, "Trust management for the semantic web," in Proc. 2nd ISWC, pp. 351--368, 2003.
|
| |
32
|
|
CITED BY 18
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Christian Böhm , Christos Faloutsos , Jia-Yu Pan , Claudia Plant, Robust information-theoretic clustering, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Eamonn Keogh , Stefano Lonardi , Chotirat Ann Ratanamahatana , Li Wei , Sang-Hee Lee , John Handley, Compression-based data mining of sequential data, Data Mining and Knowledge Discovery, v.14 n.1, p.99-129, February 2007
|
|
|
Jimeng Sun , Christos Faloutsos , Spiros Papadimitriou , Philip S. Yu, GraphScope: parameter-free mining of large time-evolving graphs, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Ruoming Jin , Scott McCallen , Yuri Breitbart , Dave Fuhry , Dong Wang, Estimating the number of frequent itemsets in a large database, Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, March 24-26, 2009, Saint Petersburg, Russia
|
|
|
|
|
|
Wei Lu , Mahbod Tavallaee , Ali A. Ghorbani, Automatic discovery of botnet communities on large-scale communication networks, Proceedings of the 4th International Symposium on Information, Computer, and Communications Security, March 10-12, 2009, Sydney, Australia
|
|
|
|
|