| Incremental and effective data summarization for dynamic hierarchical clustering |
| Full text |
Pdf
(235 KB)
|
| Source
|
International Conference on Management of Data
archive
Proceedings of the 2004 ACM SIGMOD international conference on Management of data
table of contents
Paris, France
SESSION: Research sessions: clustering
table of contents
Pages: 467 - 478
Year of Publication: 2004
ISBN:1-58113-859-8
|
|
Authors
|
|
Samer Nassar
|
University of Alberta, Edmonton, Alberta, Canada
|
|
Jörg Sander
|
University of Alberta, Edmonton, Alberta, Canada
|
|
Corrine Cheng
|
University of Alberta, Edmonton, Alberta, Canada
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 17, Downloads (12 Months): 129, Citation Count: 1
|
|
|
Warning: The download time has expired please click on the item to try again.
ABSTRACT
Mining informative patterns from very large, dynamically changing databases poses numerous interesting challenges. Data summarizations (e.g., data bubbles) have been proposed to compress very large static databases into representative points suitable for subsequent effective hierarchical cluster analysis. In many real world applications, however, the databases dynamically change due to frequent insertions and deletions, possibly changing the data distribution and clustering structure over time. Completely reapplying both the data summarization and the clustering algorithm to detect the changes in the clustering structure and update the uncovered data patterns following such deletions and insertions is prohibitively expensive for large fast changing databases. In this paper, we propose a new scheme to maintain data bubbles incrementally. By using incremental data bubbles, a high-quality hierarchical clustering is quickly available at any point in time. In our scheme, a quality measure for incremental data bubbles is used to identify data bubbles that do not compress well their underlying data points after certain insertions and deletions. Only these data bubbles are re-built using efficient split and merge operations. An extensive experimental evaluation shows that the incremental data bubbles provide significantly faster data summarization than completely re-building the data bubbles after a certain number of insertions and deletions, and are effective in preserving (and in some cases even improving) the quality of the data summarization.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Aggarwal, C. C., Han, J., Wang, J., Yu. P. S. A Framework for Clustering Evolving Data Streams. In VLDB'03, 81--92, 2003.
|
 |
2
|
Mihael Ankerst , Markus M. Breunig , Hans-Peter Kriegel , Jörg Sander, OPTICS: ordering points to identify the clustering structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.49-60, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
3
|
|
 |
4
|
|
 |
5
|
Markus M. Breunig , Hans-Peter Kriegel , Peer Kröger , Jörg Sander, Data bubbles: quality preserving performance boosting for hierarchical clustering, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.79-90, May 21-24, 2001, Santa Barbara, California, United States
|
 |
6
|
Moses Charikar , Chandra Chekuri , Tomás Feder , Rajeev Motwani, Incremental clustering and dynamic information retrieval, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, p.626-635, May 04-06, 1997, El Paso, Texas, United States
[doi> 10.1145/258533.258657]
|
| |
7
|
|
| |
8
|
Elkan, C. Using the Triangle Inequality to Accelerate k-means. In ICML'03, 2003.
|
| |
9
|
Ester, M., Kriegel, H-P., Sander, J., Xu, X. A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD'96, 226--231, 1996.
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
 |
13
|
|
| |
14
|
MacQueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In 5th Berkeley Symp. Math. Statist. Prob., 281--297, 1967.
|
| |
15
|
|
| |
16
|
Sander, J., Qin, X., Lu. Z., Niu, N. Kovarsky, A. Automated Extraction of Clusters from Hierarchical Clustering Representations. PAKDD'03.
|
| |
17
|
Sibson, R. SLINK: An Optimally Efficient Algorithm for the Single-link Cluster Method. The Computer Journal, 16(1):30--34, 1973.
|
| |
18
|
Triola, M. F., Goodman, W. M., Law, R. Elementary Statistics First Canadian Edition, Addison-Wesley, Don Mills, Ontario, 1999.
|
| |
19
|
|
 |
20
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
|