|
ABSTRACT
We present a divide-and-merge methodology for clustering a set of objects that combines a top-down “divide” phase with a bottom-up “merge” phase. In contrast, previous algorithms use either top-down or bottom-up methods to construct a hierarchical clustering or produce a flat clustering using local search (e.g., k-means). For the divide phase, which produces a tree whose leaves are the elements of the set, we suggest an efficient spectral algorithm. When the data is in the form of a sparse document-term matrix, we show how to modify the algorithm so that it maintains sparsity and runs in linear space. The merge phase quickly finds the optimal partition that respects the tree for many natural objective functions, for example, k-means, min-diameter, min-sum, correlation clustering, etc. We present a thorough experimental evaluation of the methodology. We describe the implementation of a meta-search engine that uses this methodology to cluster results from web searches. We also give comparative empirical results on several real datasets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Anderberg, M. 1973. Cluster Analysis for Applications. Academic Press, New York, NY.
|
| |
2
|
Andritsos, P., Tsaparas, P., Miller, R. J., and Sevcik, K. C. 2004. Limbo: Scalable clustering of categorical data. In Procedings of the International Conference on Extending Database Technology.
|
| |
3
|
|
 |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
 |
8
|
Moses Charikar , Chandra Chekuri , Tomás Feder , Rajeev Motwani, Incremental clustering and dynamic information retrieval, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, p.626-635, May 04-06, 1997, El Paso, Texas, United States
[doi> 10.1145/258533.258657]
|
 |
9
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
 |
10
|
W. Fernandez de la Vega , Marek Karpinski , Claire Kenyon , Yuval Rabani, Approximation schemes for clustering problems, Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, June 09-11, 2003, San Diego, CA, USA
[doi> 10.1145/780542.780550]
|
| |
11
|
Demaine, E. D. and Immorlica, N. 2003. Correlation clustering with partial information. In Proceedings of the 6th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems. 1--13.
|
 |
12
|
|
| |
13
|
Emanuel, D. and Fiat, A. 2003. Correlation clustering---minimizing disagreements on arbitrary weighted graphs. In Proceedings of the 11th European Symposium on Algorithms. 208--220.
|
| |
14
|
|
| |
15
|
Golub, T. R. Golub leukemia data. http://www.broad.mit.edu/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=43.
|
| |
16
|
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531--537.
|
| |
17
|
|
| |
18
|
Hartigan, J. A. and Wong, M. A. 1979. A k-means clustering algorithm. In Applied Statistics 28, 1, 100--108.
|
| |
19
|
|
| |
20
|
|
 |
21
|
|
 |
22
|
|
| |
23
|
|
 |
24
|
|
| |
25
|
Nickerson, A., Japkowicz, N., and Milios, E. 2001. Using unsupervised learning to guide re-sampling in imbalanced data sets. In Proceedings of the 8th International Workshop on AI and Statistics. 261--265.
|
| |
26
|
|
 |
27
|
|
| |
28
|
Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379--423.
|
| |
29
|
|
 |
30
|
|
| |
31
|
Steinbach, M., Karypis, G., and Kumar, V. 2000. A comparison of document clustering techniques. In Proceedings of the KDD Workshop on Text Mining.
|
| |
32
|
|
| |
33
|
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E., and Golub, T. 1999. Interpreting patterns of gene expression with self-organizing maps; methods and application to hematopoietic differentiation. Proc. Nat. Acad. Sci. 96, 2907--2912.
|
| |
34
|
Theiler, J. and Gisler, G. 1997. A contiguity-enhanced k-means clustering algorithm for unsupervised multispectral image segmentation. In Proceedings of the Society of Optical Engineering. 108--111.
|
| |
35
|
|
| |
36
|
Wong, W. and Fu, A. 2000. Incremental document clustering for Web page classification. In Proceedings of the IEEE International Conference on Information Society in the 21st Century: Emerging Technologies and New Challenges.
|
| |
37
|
Zamir, O., Etzioni, O., Madani, O., and Karp, R. M. 1997. Fast and intuitive clustering of Web documents. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. 287--290.
|
| |
38
|
Zha, H., Ding, C., Gu, M., He, X., and Simon, H. 2001. Spectral relaxation for k-means clustering. In Proceedings of the Conference on Neural Information Processing Systems. 1057--1064.
|
 |
39
|
|
|