ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition
Full text PdfPdf (439 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data table of contents
Vancouver, Canada
SESSION: Research Session 5: Clustering in High Dimensions table of contents
Pages: 173-184  
Year of Publication: 2008
ISBN:978-1-60558-102-6
Authors
Feng Pan  University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Xiang Zhang  University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Wei Wang  University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 207,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376616.1376637
What is a DOI?

ABSTRACT

The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, microarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually O(m X n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
 
4
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
5
 
6
C. Ding, X. He, and H. Simon. On the equivalence of nonnegative matrix factorization and. spectral clustering. In SDM, 2005.
 
7
8
 
9
10
 
11
G. Golub and A. Loan. Matrix computations. Johns Hopkins University Press, Baltimore, Maryland, 1996.
 
12
J. A. Hartigan. Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337):123--129, 1972.
 
13
D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788--791, 1999.
14
15
 
16
G. Natsoulis and et. al. Classification of a large microarray data set: Algorithm comparison and analysis of drug signatures. Genome Res., 15:724--736, 2005.
17
 
18
J. Sun, Y. Xie, H. Zhang, and C. Faloutsos. Less is more: Compact matrix decomposition for large sparse graphs. In SDM, 2007.
19
 
20

Collaborative Colleagues:
Feng Pan: colleagues
Xiang Zhang: colleagues
Wei Wang: colleagues