| CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition |
| Full text |
Pdf
(439 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
table of contents
Vancouver, Canada
SESSION: Research Session 5: Clustering in High Dimensions
table of contents
Pages: 173-184
Year of Publication: 2008
ISBN:978-1-60558-102-6
|
|
Authors
|
|
Feng Pan
|
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
|
|
Xiang Zhang
|
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
|
|
Wei Wang
|
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 19, Downloads (12 Months): 207, Citation Count: 0
|
|
|
ABSTRACT
The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, microarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually O(m X n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Arindam Banerjee , Inderjit Dhillon , Joydeep Ghosh , Srujana Merugu , Dharmendra S. Modha, A generalized maximum entropy approach to bregman co-clustering and matrix approximation, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
[doi> 10.1145/1014052.1014111]
|
 |
2
|
Deepayan Chakrabarti , Spiros Papadimitriou , Dharmendra S. Modha , Christos Faloutsos, Fully automatic cross-associations, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
[doi> 10.1145/1014052.1014064]
|
| |
3
|
|
| |
4
|
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
|
 |
5
|
|
| |
6
|
C. Ding, X. He, and H. Simon. On the equivalence of nonnegative matrix factorization and. spectral clustering. In SDM, 2005.
|
| |
7
|
|
 |
8
|
Chris Ding , Tao Li , Wei Peng , Haesun Park, Orthogonal nonnegative matrix t-factorizations for clustering, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[doi> 10.1145/1150402.1150420]
|
| |
9
|
|
 |
10
|
|
| |
11
|
G. Golub and A. Loan. Matrix computations. Johns Hopkins University Press, Baltimore, Maryland, 1996.
|
| |
12
|
J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123--129, 1972.
|
| |
13
|
D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788--791, 1999.
|
 |
14
|
|
 |
15
|
|
| |
16
|
G. Natsoulis and et. al. Classification of a large microarray data set: Algorithm comparison and analysis of drug signatures. Genome Res., 15:724--736, 2005.
|
 |
17
|
|
| |
18
|
J. Sun, Y. Xie, H. Zhang, and C. Faloutsos. Less is more: Compact matrix decomposition for large sparse graphs. In SDM, 2007.
|
 |
19
|
|
| |
20
|
|
|