| Data bubbles: quality preserving performance boosting for hierarchical clustering |
| Full text |
Pdf
(397 KB)
|
| Source
|
International Conference on Management of Data
archive
Proceedings of the 2001 ACM SIGMOD international conference on Management of data
table of contents
Santa Barbara, California, United States
Pages: 79 - 90
Year of Publication: 2001
ISBN:1-58113-332-4
Also published in ...
|
|
Authors
|
|
Markus M. Breunig
|
Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 Munich, Germany
|
|
Hans-Peter Kriegel
|
Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 Munich, Germany
|
|
Peer Kröger
|
Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 Munich, Germany
|
|
Jörg Sander
|
Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4 Canada
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 7, Downloads (12 Months): 72, Citation Count: 12
|
|
|
ABSTRACT
In this paper, we investigate how to scale hierarchical clustering methods (such as OPTICS) to extremely large databases by utilizing data compression methods (such as BIRCH or random sampling). We propose a three step procedure: 1) compress the data into suitable representative objects; 2) apply the hierarchical clustering algorithm only to these objects; 3) recover the clustering structure for the whole data set, based on the result for the compressed data. The key issue in this approach is to design compressed data items such that not only a hierarchical clustering algorithm can be applied, but also that they contain enough information to infer the clustering structure of the original data set in the third step. This is crucial because the results of hierarchical clustering algorithms, when applied naively to a random sample or to the clustering features (CFs) generated by BIRCH, deteriorate rapidly for higher compression rates. This is due to three key problems, which we identify. To solve these problems, we propose an efficient post-processing step and the concept of a Data Bubble as a special kind of compressed data item. Applying OPTICS to these Data Bubbles allows us to recover a very accurate approximation of the clustering structure of a large data set even for very high compression rates. A comprehensive performance and quality evaluation shows that we only trade very little quality of the clustering result for a great increase in performance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Mihael Ankerst , Markus M. Breunig , Hans-Peter Kriegel , Jörg Sander, OPTICS: ordering points to identify the clustering structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.49-60, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
2
|
Bradley P. S., Fayyad U., Reina C.: "Scaling Clustering Algorithms to Large Databases", Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining, New York, NY, AAAI Press, 1998, pp. 9-15.
|
| |
3
|
|
 |
4
|
William DuMouchel , Chris Volinsky , Theodore Johnson , Corinna Cortes , Daryl Pregibon, Squashing flat files flatter, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.6-15, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312184]
|
| |
5
|
Ester M., Kriegel H.-P., Sander J., Xu X.: "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, 1996, pp. 226-231.
|
| |
6
|
|
| |
7
|
Kaufman L., Rousseeuw P. J.: "Finding Groups in Data: An Introduction to Cluster Analysis",John Wiley &Sons,1990.
|
| |
8
|
MacQueen J.: "Some Methods for Classification and Analysis of Multivariate Observations", Proc. 5th Berkeley Symp. Math. Statist. Prob., 1967, Vol. 1, pp. 281-297.
|
| |
9
|
Sibson R.: "SLINK: an optimally efficient algorithm for the single-link cluster method", the Computer Journal Vol. 16, No. 1, 1973, pp. 30-34.
|
 |
10
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
CITED BY 12
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Salvatore Rinzivillo , Dino Pedreschi , Mirco Nanni , Fosca Giannotti , Natalia Andrienko , Gennady Andrienko, Visually driven analysis of movement data by progressive clustering, Information Visualization, v.7 n.3, p.225-239, June 2008
|
|