ACM Home Page
Please provide us with feedback. Feedback
A divide-and-merge methodology for clustering
Full text PdfPdf (1.66 MB)
Source ACM Transactions on Database Systems (TODS) archive
Volume 31 ,  Issue 4  (December 2006) table of contents
Pages: 1499 - 1525  
Year of Publication: 2006
ISSN:0362-5915
Authors
David Cheng  Massachusetts Institute of Technology, Cambridge, MA
Ravi Kannan  Yale University, New Haven, CT
Santosh Vempala  Massachusetts Institute of Technology, Cambridge, MA
Grant Wang  Massachusetts Institute of Technology, Cambridge, MA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 26,   Downloads (12 Months): 229,   Citation Count: 8
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1189769.1189779
What is a DOI?

ABSTRACT

We present a divide-and-merge methodology for clustering a set of objects that combines a top-down “divide” phase with a bottom-up “merge” phase. In contrast, previous algorithms use either top-down or bottom-up methods to construct a hierarchical clustering or produce a flat clustering using local search (e.g., k-means). For the divide phase, which produces a tree whose leaves are the elements of the set, we suggest an efficient spectral algorithm. When the data is in the form of a sparse document-term matrix, we show how to modify the algorithm so that it maintains sparsity and runs in linear space. The merge phase quickly finds the optimal partition that respects the tree for many natural objective functions, for example, k-means, min-diameter, min-sum, correlation clustering, etc. We present a thorough experimental evaluation of the methodology. We describe the implementation of a meta-search engine that uses this methodology to cluster results from web searches. We also give comparative empirical results on several real datasets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Anderberg, M. 1973. Cluster Analysis for Applications. Academic Press, New York, NY.
 
2
Andritsos, P., Tsaparas, P., Miller, R. J., and Sevcik, K. C. 2004. Limbo: Scalable clustering of categorical data. In Procedings of the International Conference on Extending Database Technology.
 
3
4
5
 
6
 
7
8
9
10
 
11
Demaine, E. D. and Immorlica, N. 2003. Correlation clustering with partial information. In Proceedings of the 6th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems. 1--13.
12
 
13
Emanuel, D. and Fiat, A. 2003. Correlation clustering---minimizing disagreements on arbitrary weighted graphs. In Proceedings of the 11th European Symposium on Algorithms. 208--220.
 
14
 
15
Golub, T. R. Golub leukemia data. http://www.broad.mit.edu/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43.
 
16
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531--537.
 
17
 
18
Hartigan, J. A. and Wong, M. A. 1979. A k-means clustering algorithm. In Applied Statistics 28, 1, 100--108.
 
19
 
20
21
22
 
23
24
 
25
Nickerson, A., Japkowicz, N., and Milios, E. 2001. Using unsupervised learning to guide re-sampling in imbalanced data sets. In Proceedings of the 8th International Workshop on AI and Statistics. 261--265.
 
26
27
 
28
Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379--423.
 
29
30
 
31
Steinbach, M., Karypis, G., and Kumar, V. 2000. A comparison of document clustering techniques. In Proceedings of the KDD Workshop on Text Mining.
 
32
 
33
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E., and Golub, T. 1999. Interpreting patterns of gene expression with self-organizing maps; methods and application to hematopoietic differentiation. Proc. Nat. Acad. Sci. 96, 2907--2912.
 
34
Theiler, J. and Gisler, G. 1997. A contiguity-enhanced k-means clustering algorithm for unsupervised multispectral image segmentation. In Proceedings of the Society of Optical Engineering. 108--111.
 
35
 
36
Wong, W. and Fu, A. 2000. Incremental document clustering for Web page classification. In Proceedings of the IEEE International Conference on Information Society in the 21st Century: Emerging Technologies and New Challenges.
 
37
Zamir, O., Etzioni, O., Madani, O., and Karp, R. M. 1997. Fast and intuitive clustering of Web documents. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. 287--290.
 
38
Zha, H., Ding, C., Gu, M., He, X., and Simon, H. 2001. Spectral relaxation for k-means clustering. In Proceedings of the Conference on Neural Information Processing Systems. 1057--1064.
39

CITED BY  9

Collaborative Colleagues:
David Cheng: colleagues
Ravi Kannan: colleagues
Santosh Vempala: colleagues
Grant Wang: colleagues