ACM Home Page
Please provide us with feedback. Feedback
Fully automatic cross-associations
Full text PdfPdf (2.32 MB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Seattle, WA, USA
SESSION: Research track papers table of contents
Pages: 79 - 88  
Year of Publication: 2004
ISBN:1-58113-888-1
Authors
Deepayan Chakrabarti  Carnegie Mellon University, Pittsburgh, PA
Spiros Papadimitriou  Carnegie Mellon University, Pittsburgh, PA
Dharmendra S. Modha  IBM Almaden Research Center, San Jose, CA
Christos Faloutsos  Carnegie Mellon University, Pittsburgh, PA
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 93,   Citation Count: 18
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1014052.1014064
What is a DOI?

ABSTRACT

Large, sparse binary matrices arise in numerous data mining applications, such as the analysis of market baskets, web graphs, social networks, co-citations, as well as information retrieval, collaborative filtering, sparse matrix reordering, etc. Virtually all popular methods for the analysis of such matrices---e.g., k-means clustering, METIS graph partitioning, SVD/PCA and frequent itemset mining---require the user to specify various parameters, such as the number of clusters, number of principal components, number of partitions, and "support." Choosing suitable values for such parameters is a challenging problem.Cross-association is a joint decomposition of a binary matrix into disjoint row and column groups such that the rectangular intersections of groups are homogeneous. Starting from first principles, we furnish a clear, information-theoretic criterion to choose a good cross-association as well as its parameters, namely, the number of row and column groups. We provide scalable algorithms to approach the optimal. Our algorithm is parameter-free, and requires no user intervention. In practice it scales linearly with the problem size, and is thus applicable to very large matrices. Finally, we present experiments on multiple synthetic and real-life datasets, where our method gives high-quality, intuitive results.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
G. Hamerly and C. Elkan, "Learning the k in k-means," in Proc. 17th NIPS, 2003.
 
3
 
4
5
6
 
7
 
8
A. Hinneburg and D. A. Keim, "An efficient approach to clustering in large multimedia databases with noise," in Proc. 4th KDD, pp. 58--65, 1998.
 
9
10
 
11
M. M. Madiman, M. Harrison, and I. Kontoyiannis, "A minimum description length proposal for lossy data compression," in Proc. IEEE ISIT, 2004.
 
12
 
13
 
14
15
 
16
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by latent semantic analysis," JASI, vol. 41, pp. 391--407, 1990.
17
18
19
 
20
 
21
A. Y. Ng, M. I. Jordan, and Y. Weiss, "On spectral clustering: Analysis and an algorithm," in Proc. NIPS, pp. 849--856, 2001.
 
22
N. Mishra, D. Ron, and R. Swaminathan, "On finding large conjunctive clusters," in Proc. 16th COLT, 2003.
 
23
24
 
25
J. Rissanen, "Modeling by shortest data description," Automatica, vol. 14, pp. 465--471, 1978.
 
26
J. Rissanen, "Generalized Kraft inequality and arithmetic coding," IBM J. Res. Dev., vol. 20, no. 3, pp. 198--203, 1976.
 
27
J. Rissanen and G. G. Langdon Jr., "Arithmetic coding," IBM J. Res. Dev., vol. 23, pp. 149--162, 1979.
28
 
29
J. Rissanen, "Universal prior for integers and estimation by minimum description length," Annals of Statistics, vol. 11, no. 2, pp. 416--431, 1983.
 
30
 
31
M. Richardson, R. Agrawal, and P. Domingos, "Trust management for the semantic web," in Proc. 2nd ISWC, pp. 351--368, 2003.
 
32

CITED BY  18

Collaborative Colleagues:
Deepayan Chakrabarti: colleagues
Spiros Papadimitriou: colleagues
Dharmendra S. Modha: colleagues
Christos Faloutsos: colleagues