|
ABSTRACT
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
CKS88
|
Peter C, heeselnan, James Kelly, Matthew Self, et al., Auto CIass : A Bayesian Classification System, Proc. of the 5th Int'l Conf. on Machine Learning, Morgan Kaufman, 3un. 1988.
|
| |
DH73
|
Richard Duds, and Peter E. Hart, Pattern Classification and Scene Analysis, Wiley, 1973.
|
| |
DJ80
|
R. Dubes, and A.K. Jain, Clustering Methodologies in Exploratory Data Analysis Advances in C, omputers, Edited by M.C. Yovits, Vol. 19, Academic Press, New York, 1980.
|
| |
EKX95a
|
Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu, A Database Interface .for Clustering in Large Spatial Databases, Proc. of 1st {nt'l Conf. on Knowledge Discovery and Data Mining, 1995.
|
| |
EKX95b
|
|
| |
Fis87
|
|
| |
Fis95
|
Douglas H. Fisher, Iterative Optimization and Simplification of Hierarchical CIuaterings, Technical Report CS-95-01, Dept. of Computer Science, Vanderbilt l lniversity, Nashville, TN 37235.
|
| |
GG92
|
|
| |
KR90
|
Leonard Kaufman, and Peter J. Rousseeuw, Finding Groups in Data - An Introduction to Cluster Analysis, Wiley Series in Probability and Mathematical Statistics, 1990.
|
| |
Leb87
|
|
| |
Lee81
|
R.C.T.Lee, Clustering analysis and its applications, Advances in Information Systems Science, Edited by J .T.Toum, Vol. 8, pp. 169-292, Plenum Press, New York, 1981.
|
| |
Mur83
|
F. Murtagh, A Survey of _Recent Advances in Hier'archical Clustering Algorithms, The Computer Journal, 1933.
|
| |
NH94
|
|
| |
Ols93
|
|
| |
ZRL95
|
Tian Zhang, Raghu Ramakrishnan, and Miron Livl~y, BIRCH: An Efficient Data Clustering Method .for Very Large Databases, Technical Report, Computer Sciences Dept., Univ. of Wisconsin-Madison, 1995.
|
CITED BY 351
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Paul B. Chou , Edna Grossman , Dimitrios Gunopulos , Pasumarti Kamesam, Identifying prospective customers, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.447-456, August 20-23, 2000, Boston, Massachusetts, United States
|
|
|
|
|
|
|
|
|
Ke Wang , Chu Xu , Bing Liu, Clustering transactions using large items, Proceedings of the eighth international conference on Information and knowledge management, p.483-490, November 02-06, 1999, Kansas City, Missouri, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Venkatesh Ganti , Johannes Gehrke , Raghu Ramakrishnan, A framework for measuring changes in data characteristics, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.126-137, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
|
|
|
|
|
Jason Tsong-Li Wang , Xiong Wang , King-Ip Lin , Dennis Shasha , Bruce A. Shapiro , Kaizhong Zhang, Evaluating a class of distance-mapping algorithms for data mining and clustering, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.307-311, August 15-18, 1999, San Diego, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
Brain Babcock , Mayur Datar , Rajeev Motwani , Liadan O'Callaghan, Maintaining variance and k-medians over data stream windows, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.234-243, June 09-11, 2003, San Diego, California
|
|
|
Minos N. Garofalakis , Rajeev Rastogi , S. Seshadri , Kyuseok Shim, Data mining and the Web: past, present and future, Proceedings of the 2nd international workshop on Web information and data management, p.43-47, November 02-06, 1999, Kansas City, Missouri, United States
|
|
|
Andrew McCallum , Kamal Nigam , Lyle H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Chun-Hung Cheng , Ada Waichee Fu , Yi Zhang, Entropy-based subspace clustering for mining numerical data, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.84-93, August 15-18, 1999, San Diego, California, United States
|
|
|
|
|
|
|
|
|
Jayavel Shanmugasundaram , Usama Fayyad , P. S. Bradley, Compressed data cubes for OLAP aggregate query approximation on continuous dimensions, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.223-232, August 15-18, 1999, San Diego, California, United States
|
|
|
Agma Traina , Caetano Traina , Spiros Papadimitriou , Christos Faloutsos, Tri-plots: scalable tools for multidimensional data mining, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, p.184-193, August 26-29, 2001, San Francisco, California
|
|
|
|
|
|
|
|
|
|
|
|
Doug Warner , J. Neal Richter , Stephen D. Durbin , Bikramjit Banerjee, Mining user session data to facilitate user interaction with a customer service knowledge base in RightNow Web, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, p.467-472, August 26-29, 2001, San Francisco, California
|
|
|
|
|
|
|
|
|
M. Livny , R. Ramakrishnan , K. Beyer , G. Chen , D. Donjerkovic , S. Lawande , J. Myllymaki , K. Wenger, DEVise: integrated querying and visual exploration of large datasets, ACM SIGMOD Record, v.26 n.2, p.301-312, June 1997
|
|
|
In-Soo Kang , Tae-wan Kim , Ki-Joune Li, A spatial data mining method by Delaunay triangulation, Proceedings of the 5th ACM international workshop on Advances in geographic information systems, p.35-39, November 10-14, 1997, Las Vegas, Nevada, United States
|
|
|
|
|
|
Eun-Jeong Son , In-Soo Kang , Tae-Wan Kim , Ki-Joune Li, A spatial data mining method by clustering analysis, Proceedings of the 6th ACM international symposium on Advances in geographic information systems, p.157-158, November 02-07, 1998, Washington, D.C., United States
|
|
|
|
|
|
|
|
|
|
|
|
M. Livny , R. Ramakrishnan , K. Beyer , G. Chen , D. Donjerkovic , S. Lawande , J. Myllymaki , K. Wenger, DEVise (demo abstract): integrated querying and visual exploration of large datasets, ACM SIGMOD Record, v.26 n.2, p.517-520, June 1997
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Y. Dora Cai , David Clutter , Greg Pape , Jiawei Han , Michael Welge , Loretta Auvil, MAIDS: mining alarming incidents from data streams, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
|
|
|
|
|
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, Fast density estimation using CF-kernel for very large databases, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.312-316, August 15-18, 1999, San Diego, California, United States
|
|
|
|
|
|
Bing Liu , Yiyuan Xia , Philip S. Yu, Clustering through decision tree construction, Proceedings of the ninth international conference on Information and knowledge management, p.20-29, November 06-11, 2000, McLean, Virginia, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Charu C. Aggarwal , Jiawei Han , Jianyong Wang , Philip S. Yu, On demand classification of data streams, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Deepayan Chakrabarti , Spiros Papadimitriou , Dharmendra S. Modha , Christos Faloutsos, Fully automatic cross-associations, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Stephen D. Durbin , Doug Warner , J. Neal Richter , Zuzana Gedeon, Rightnow eservice center: internet customer service using a self-learning knowledge base, Eighteenth national conference on Artificial intelligence, p.815-821, July 28-August 01, 2002, Edmonton, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sudipto Guha , H. V. Jagadish , Nick Koudas , Divesh Srivastava , Ting Yu, Approximate XML joins, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Christian Böhm , Christos Faloutsos , Jia-Yu Pan , Claudia Plant, Robust information-theoretic clustering, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yi Zhuang , Yueting Zhuang , Qing Li , Lei Chen, Towards interactive indexing for large Chinese calligraphic character databases, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
P. Compieta , S. Di Martino , M. Bertolotto , F. Ferrucci , T. Kechadi, Exploratory spatio-temporal data mining and visualization, Journal of Visual Languages and Computing, v.18 n.3, p.255-279, June, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
David Johnson , Shankar Krishnan , Jatin Chhugani , Subodh Kumar , Suresh Venkatasubramanian, Compressing large boolean matrices using reordering techniques, Proceedings of the Thirtieth international conference on Very large data bases, p.13-23, August 31-September 03, 2004, Toronto, Canada
|
|
|
Charu C. Aggarwal , Jiawei Han , Jianyong Wang , Philip S. Yu, A framework for projected clustering of high dimensional data streams, Proceedings of the Thirtieth international conference on Very large data bases, p.852-863, August 31-September 03, 2004, Toronto, Canada
|
|
|
Charu C. Aggarwal , Jiawei Han , Jianyong Wang , Philip S. Yu, A framework for clustering evolving data streams, Proceedings of the 29th international conference on Very large data bases, p.81-92, September 09-12, 2003, Berlin, Germany
|
|
|
|
|
|
|
|
|
|
|
|
Antara Ghosh , Jignashu Parikh , Vibhuti S. Sengar , Jayant R. Haritsa, Plan selection based on query clustering, Proceedings of the 28th international conference on Very Large Data Bases, p.179-190, August 20-23, 2002, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bin Li , Mingmin Chi , Jianping Fan , Xiangyang Xue, Support cluster machine, Proceedings of the 24th international conference on Machine learning, p.505-512, June 20-24, 2007, Corvalis, Oregon
|
|
|
|
|
|
|
|
|
Rashmin Babaria , J. Saketha Nath , Krishnan S , Sivaramakrishnan K R , Chiranjib Bhattacharyya , M. N. Murty, Focused crawling with scalable ordinal regression solvers, Proceedings of the 24th international conference on Machine learning, p.57-64, June 20-24, 2007, Corvalis, Oregon
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carlotta Domeniconi , Dimitrios Gunopulos , Sheng Ma , Bojun Yan , Muna Al-Razgan , Dimitris Papadopoulos, Locally adaptive metrics for clustering high dimensional data, Data Mining and Knowledge Discovery, v.14 n.1, p.63-97, February 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
David Madigan , Nandini Raghavan , William Dumouchel , Martha Nason , Christian Posse , Greg Ridgeway, Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction, Data Mining and Knowledge Discovery, v.6 n.2, p.173-190, April 2002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Charu C. Aggarwal , Na Ta , Jianyong Wang , Jianhua Feng , Mohammed Zaki, Xproj: a framework for projected structural clustering of xml documents, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rong Ge , Martin Ester , Wen Jin , Ian Davidson, Constraint-driven clustering, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
Isabel Praça , Maria João Viamonte , Zita Vale , Carlos Ramos, Agent-based simulation of electronic marketplaces with decision support, Proceedings of the 2008 ACM symposium on Applied computing, March 16-20, 2008, Fortaleza, Ceara, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
Maria Camila N. Barioni , Humberto L. Razente , Agma J. M. Traina , Caetano Traina, Jr., Accelerating k-medoid-based algorithms through metric access methods, Journal of Systems and Software, v.81 n.3, p.343-355, March, 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Huanhuan Cao , Daxin Jiang , Jian Pei , Qi He , Zhen Liao , Enhong Chen , Hang Li, Context-aware query suggestion by mining click-through and session data, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Guang-Hui Yan , Li-Song Liu , Lin-Na Du , Xia-Xia Yang , Zhi-Cheng Ma , Xiao-Min Zhang, Multifractal-based cluster hierarchy optimisation algorithm, International Journal of Business Intelligence and Data Mining, v.3 n.4, p.353-374, January 2009
|
|
|
|
|
|
|
|
|
Peter Mork , Ken Smith , Barbara Blaustein , Chris Wolf , Keri Sarver, Facilitating discovery on the private web using dataset digests, Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services, November 24-26, 2008, Linz, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yihong Dong , Shaoka Cao , Ken Chen , Maoshun He , Xiaoying Tai, PFHC: A clustering algorithm based on data partitioning for unevenly distributed datasets, Fuzzy Sets and Systems, v.160 n.13, p.1886-1901, July, 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hui Xiong , Junjie Wu , Jian Chen, K-means clustering versus validation measures: a data-distribution perspective, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, v.39 n.2, p.318-331, April 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ting Wang , Mudhakar Srivatsa , Dakshi Agrawal , Ling Liu, Learning, indexing, and diagnosing network faults, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|
|
Stephen D. Durbin , Doug Warner , J. Neal Richter , Zuzana Gedeon, Rightnow eservice center: internet customer service using a self-learning knowledge base, Proceedings of the 14th conference on Innovative applications of artificial intelligence, p.815-821, July 28-August 01, 2002, Edmonton, Alberta, Canada
|
|
|
|
|