ACM Home Page
Please provide us with feedback. Feedback
Clustering aggregation
Full text PdfPdf (627 KB)
Source
ACM Transactions on Knowledge Discovery from Data (TKDD) archive
Volume 1 ,  Issue 1  (March 2007) table of contents
Article No. 4  
Year of Publication: 2007
ISSN:1556-4681
Authors
Aristides Gionis  Yahoo! Research Labs, Barcelona, Spain
Heikki Mannila  University of Helsinki and Helsinki University of Technology
Panayiotis Tsaparas  Microsoft Search Labs
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 53,   Downloads (12 Months): 487,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1217299.1217303
What is a DOI?

ABSTRACT

We consider the following problem: given a set of clusterings, find a single clustering that agrees as much as possible with the input clusterings. This problem, clustering aggregation, appears naturally in various contexts. For example, clustering categorical data is an instance of the clustering aggregation problem; each categorical attribute can be viewed as a clustering of the input rows where rows are grouped together if they take the same value on that attribute. Clustering aggregation can also be used as a metaclustering method to improve the robustness of clustering by combining the output of multiple algorithms. Furthermore, the problem formulation does not require a priori information about the number of clusters; it is naturally determined by the optimization function.

In this article, we give a formal statement of the clustering aggregation problem, and we propose a number of algorithms. Our algorithms make use of the connection between clustering aggregation and the problem of correlation clustering. Although the problems we consider are NP-hard, for several of our methods, we provide theoretical guarantees on the quality of the solutions. Our work provides the best deterministic approximation algorithm for the variation of the correlation clustering problem we consider. We also show how sampling can be used to scale the algorithms for large datasets. We give an extensive empirical evaluation demonstrating the usefulness of the problem and of the solutions.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Andritsos, P., Tsaparas, P., Miller, R. J., and Sevcik, K. C. 2004. LIMBO: Scalable clustering of categorical data. In Proceedings of the International Conference on Extending Database Technology (EDBT). 123--146.
 
3
 
4
Barthelemy, J.-P. and Leclerc, B. 1995. The median procedure for partitions. DIMACS Series in Discrete Mathematics, 3--34.
 
5
Blake, C. L. and Merz, C. J. 1998. UCI repository of machine learning databases.
 
6
 
7
 
8
Cristofor, D. and Simovici, D. A. 2001. An information-theoretical approach to genetic algorithms for clustering. Tech. rep. TR-01-02, UMass, Boston, MA.
 
9
 
10
Deza, M. and Laurent, M. 1997. Geometry of Cuts and Metrics. Springer-Verlag.
11
 
12
 
13
Fern, X. Z. and Brodley, C. E. 2003. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the International Conference on Machine Learning (ICML). 186--193.
 
14
Filkov, V. and Skiena, S. 2004. Integrating microarray data by consensus clustering. Int. J. AI Tools 13, 4, 863--880.
 
15
 
16
 
17
Hamerly, G. and Elkan, C. 2003. Learning the k in k-means. In Advances in Neural Information Processing Systems (NIPS).
 
18
 
19
 
20
Hochbaum, D. and Shmoys, D. 1985. A best possible heuristic for the k-center problem. Mathem. Operat. Resea., 180--184.
 
21
22
 
23
Schwarz, G. 1978. Estimating Dimension of a Model. Ann. Statis. 6, 461--464.
 
24
 
25
 
26
 
27
Topchy, A., Jain, A. K., and Punch, W. 2004. A mixture model of clustering ensembles. In Proceedings of the SIAM International Conference on Data Mining (SDM). 379--390.


Collaborative Colleagues:
Aristides Gionis: colleagues
Heikki Mannila: colleagues
Panayiotis Tsaparas: colleagues