ACM Home Page
Please provide us with feedback. Feedback
RIC: Parameter-free noise-robust clustering
Full text PdfPdf (1.05 MB)
Source
ACM Transactions on Knowledge Discovery from Data (TKDD) archive
Volume 1 ,  Issue 3  (December 2007) table of contents
Article No. 10  
Year of Publication: 2007
ISSN:1556-4681
Authors
Christian Böhm  University of Munich, Munich, Germany
Christos Faloutsos  Carnegie Mellon University, Pittsburgh, PA
Jia-Yu Pan  Google, Mountain View, CA
Claudia Plant  University of Munich, Munich, Germany
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 156,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1297332.1297334
What is a DOI?

ABSTRACT

How do we find a natural clustering of a real-world point set which contains an unknown number of clusters with different shapes, and which may be contaminated by noise? As most clustering algorithms were designed with certain assumptions (Gaussianity), they often require the user to give input parameters, and are sensitive to noise. In this article, we propose a robust framework for determining a natural clustering of a given dataset, based on the minimum description length (MDL) principle. The proposed framework, robust information-theoretic clustering (RIC), is orthogonal to any known clustering algorithm: Given a preliminary clustering, RIC purifies these clusters from noise, and adjusts the clusterings such that it simultaneously determines the most natural amount and shape (subspace) of the clusters. Our RIC method can be combined with any clustering technique ranging from K-means and K-medoids to advanced methods such as spectral clustering. In fact, RIC is even able to purify and improve an initial coarse clustering, even if we start with very simple methods. In an extension, we propose a fully automatic stand-alone clustering method and efficiency improvements. RIC scales well with the dataset size. Extensive experiments on synthetic and real-world datasets validate the proposed RIC framework.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
Banfield, J. D. and Raftery, A. E. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 3, 803--821.
 
5
6
7
 
8
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the ACM SIGKDD Conference on International Knowledge Discovery and Data Mining.
 
9
Grünwald, P. 2005. A tutorial introduction to the minimum description length principle. Advances in Minimum Description Length: Theory and Applications.
10
 
11
Hamerly, G. and Elkan, C. 2003. Learning the k in k-means. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).
 
12
 
13
Jolliffe, I. 1986. Principal Component Analysis. Springer.
 
14
Murtagh, F. 1983. A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26, 4, 354--359.
 
15
Ng, A., Jordan, M., and Weiss, Y. 2001. On spectral clustering: Analysis and an algorithm. In Proceedings of the Conference on Advances in Neural Information Processing Systems.
 
16
 
17
18
 
19
 
20
Tibshirani, R., Walther, G., and Hastie, T. 2000. Estimating the number of clusters in a dataset via the gap statistic. Tech. Rep., Stanford University.
 
21
Tishby, N., Pereira, F. C., and Bialek, W. 2000. The information bottleneck method. In Proceedings of the 37th Allerton Conference on Communication, Control and Computing.
22
 
23
 
24
25

Collaborative Colleagues:
Christian Böhm: colleagues
Christos Faloutsos: colleagues
Jia-Yu Pan: colleagues
Claudia Plant: colleagues