ACM Home Page
Please provide us with feedback. Feedback
BIRCH: an efficient data clustering method for very large databases
Full text PdfPdf (1.49 MB)
Source International Conference on Management of Data archive
Proceedings of the 1996 ACM SIGMOD international conference on Management of data table of contents
Montreal, Quebec, Canada
Pages: 103 - 114  
Year of Publication: 1996
ISBN:0-89791-794-4
Also published in ...
Authors
Tian Zhang  Computer Sciences Dept., Univ. of Wisconsin-Madison
Raghu Ramakrishnan  Computer Sciences Dept., Univ. of Wisconsin-Madison
Miron Livny  Computer Sciences Dept., Univ. of Wisconsin-Madison
Sponsors
SIGART: ACM Special Interest Group on Artificial Intelligence
SIGMOD: ACM Special Interest Group on Management of Data
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 88,   Downloads (12 Months): 632,   Citation Count: 349
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/233269.233324
What is a DOI?

ABSTRACT

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
CKS88
Peter C, heeselnan, James Kelly, Matthew Self, et al., Auto CIass : A Bayesian Classification System, Proc. of the 5th Int'l Conf. on Machine Learning, Morgan Kaufman, 3un. 1988.
 
DH73
Richard Duds, and Peter E. Hart, Pattern Classification and Scene Analysis, Wiley, 1973.
 
DJ80
R. Dubes, and A.K. Jain, Clustering Methodologies in Exploratory Data Analysis Advances in C, omputers, Edited by M.C. Yovits, Vol. 19, Academic Press, New York, 1980.
 
EKX95a
Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu, A Database Interface .for Clustering in Large Spatial Databases, Proc. of 1st {nt'l Conf. on Knowledge Discovery and Data Mining, 1995.
 
EKX95b
 
Fis87
 
Fis95
Douglas H. Fisher, Iterative Optimization and Simplification of Hierarchical CIuaterings, Technical Report CS-95-01, Dept. of Computer Science, Vanderbilt l lniversity, Nashville, TN 37235.
 
GG92
 
KR90
Leonard Kaufman, and Peter J. Rousseeuw, Finding Groups in Data - An Introduction to Cluster Analysis, Wiley Series in Probability and Mathematical Statistics, 1990.
 
Leb87
 
Lee81
R.C.T.Lee, Clustering analysis and its applications, Advances in Information Systems Science, Edited by J .T.Toum, Vol. 8, pp. 169-292, Plenum Press, New York, 1981.
 
Mur83
F. Murtagh, A Survey of _Recent Advances in Hier'archical Clustering Algorithms, The Computer Journal, 1933.
 
NH94
 
Ols93
 
ZRL95
Tian Zhang, Raghu Ramakrishnan, and Miron Livl~y, BIRCH: An Efficient Data Clustering Method .for Very Large Databases, Technical Report, Computer Sciences Dept., Univ. of Wisconsin-Madison, 1995.

CITED BY  351

Collaborative Colleagues:
Tian Zhang: colleagues
Raghu Ramakrishnan: colleagues
Miron Livny: colleagues