|
ABSTRACT
Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
BKSS90
|
Norbert Beckmann , Hans-Peter Kriegel , Ralf Schneider , Bernhard Seeger, The R*-tree: an efficient and robust access method for points and rectangles, Proceedings of the 1990 ACM SIGMOD international conference on Management of data, p.322-331, May 23-26, 1990, Atlantic City, New Jersey, United States
|
| |
CLR90
|
|
| |
EKSX96
|
Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial database with noise. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), Portland, Oregon, August 1996.
|
| |
EKX95
|
Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. A database interface for clustering in large spatial databases. In Int'! Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995.
|
| |
GRS97
|
Sudipto Guha, R. Rastogi, and K. Shim. CURE: A clustering algorithm for large databases. Technical report, Bell Laboratories, Murray Hill, 1997.
|
| |
JD88
|
|
| |
MR95
|
|
| |
NH94
|
|
| |
Ols93
|
|
| |
Sam89
|
|
| |
Sam90
|
|
| |
SRF87
|
|
 |
Vit85
|
|
 |
ZRL96
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
CITED BY 234
|
|
|
|
|
|
|
|
|
|
|
Chun-Hung Cheng , Ada Waichee Fu , Yi Zhang, Entropy-based subspace clustering for mining numerical data, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.84-93, August 15-18, 1999, San Diego, California, United States
|
|
|
|
|
|
|
|
|
Christian Böhm , Bernhard Braunmüller , Markus Breunig , Hans-Peter Kriegel, High performance clustering based on the similarity join, Proceedings of the ninth international conference on Information and knowledge management, p.298-305, November 06-11, 2000, McLean, Virginia, United States
|
|
|
Venkatesh Ganti , Johannes Gehrke , Raghu Ramakrishnan, A framework for measuring changes in data characteristics, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.126-137, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Minos N. Garofalakis , Rajeev Rastogi , S. Seshadri , Kyuseok Shim, Data mining and the Web: past, present and future, Proceedings of the 2nd international workshop on Web information and data management, p.43-47, November 02-06, 1999, Kansas City, Missouri, United States
|
|
|
Jason Tsong-Li Wang , Xiong Wang , King-Ip Lin , Dennis Shasha , Bruce A. Shapiro , Kaizhong Zhang, Evaluating a class of distance-mapping algorithms for data mining and clustering, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.307-311, August 15-18, 1999, San Diego, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bing Liu , Yiyuan Xia , Philip S. Yu, Clustering through decision tree construction, Proceedings of the ninth international conference on Information and knowledge management, p.20-29, November 06-11, 2000, McLean, Virginia, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Y. Dora Cai , David Clutter , Greg Pape , Jiawei Han , Michael Welge , Loretta Auvil, MAIDS: mining alarming incidents from data streams, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Deepayan Chakrabarti , Spiros Papadimitriou , Dharmendra S. Modha , Christos Faloutsos, Fully automatic cross-associations, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xin Zhang , Nikos Mamoulis , David W. Cheung , Yutao Shou, Fast mining of spatial collocations, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
Edith Cohen , Mayur Datar , Shinji Fujiwara , Aristides Gionis , Piotr Indyk , Rajeev Motwani , Jeffrey D. Ullman , Cheng Yang, Finding Interesting Associations without Support Pruning, IEEE Transactions on Knowledge and Data Engineering, v.13 n.1, p.64-78, January 2001
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Brain Babcock , Mayur Datar , Rajeev Motwani , Liadan O'Callaghan, Maintaining variance and k-medians over data stream windows, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.234-243, June 09-11, 2003, San Diego, California
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sudipto Guha , H. V. Jagadish , Nick Koudas , Divesh Srivastava , Ting Yu, Approximate XML joins, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Csaba Legány , Sándor Juhász , Attila Babos, Cluster validity measurement techniques, Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, p.388-393, February 15-17, 2006, Madrid, Spain
|
|
|
|
|
|
|
|
|
|
|
|
Christian Böhm , Christos Faloutsos , Jia-Yu Pan , Claudia Plant, Robust information-theoretic clustering, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Christopher R. Lumb , Jiri Schindler , Gregory R. Ganger , David F. Nagle , Erik Riedel, Towards higher disk head utilization: extracting free bandwidth from busy disk drives, Proceedings of the 4th conference on Symposium on Operating System Design & Implementation, p.7-7, October 22-25, 2000, San Diego, California
|
|
|
|
|
|
|
|
|
Ronald N. Kostoff , J. Antonio Del Río , Héctor D. Cortés , Charles Smith , Andrew Smith , Caroline Wagner , Loet Leydesdorff , George Karypis , Guido Malpohl , Rene Tshiteya, Clustering methodologies for identifying country core competencies, Journal of Information Science, v.33 n.1, p.21-40, February 2007
|
|
|
|
|
|
|
|
|
Charu C. Aggarwal , Jiawei Han , Jianyong Wang , Philip S. Yu, A framework for projected clustering of high dimensional data streams, Proceedings of the Thirtieth international conference on Very large data bases, p.852-863, August 31-September 03, 2004, Toronto, Canada
|
|
|
Charu C. Aggarwal , Jiawei Han , Jianyong Wang , Philip S. Yu, A framework for clustering evolving data streams, Proceedings of the 29th international conference on Very large data bases, p.81-92, September 09-12, 2003, Berlin, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Maria Camila N. Barioni , Humberto L. Razente , Agma J. M. Traina , Caetano Traina, Jr., Accelerating k-medoid-based algorithms through metric access methods, Journal of Systems and Software, v.81 n.3, p.343-355, March, 2008
|
|
|
|
|
|
|
|
|
Jren-Chit Chin , David K.Y. Yau , Nageswara S.V. Rao , Yong Yang , Chris Y.T. Ma , Mallikarjun Shankar, Accurate localization of low-level radioactive source under noise and measurement errors, Proceedings of the 6th ACM conference on Embedded network sensor systems, November 05-07, 2008, Raleigh, NC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Guang-Hui Yan , Li-Song Liu , Lin-Na Du , Xia-Xia Yang , Zhi-Cheng Ma , Xiao-Min Zhang, Multifractal-based cluster hierarchy optimisation algorithm, International Journal of Business Intelligence and Data Mining, v.3 n.4, p.353-374, January 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yihong Dong , Shaoka Cao , Ken Chen , Maoshun He , Xiaoying Tai, PFHC: A clustering algorithm based on data partitioning for unevenly distributed datasets, Fuzzy Sets and Systems, v.160 n.13, p.1886-1901, July, 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Erendira Rendón , Rene Garcia , Itzel Abundez , Citlalih Gutierrez , Eduardo Gasca , Federico Del Razo , Adrian Gonzalez, NIVA: a Robust cluster validity, Proceedings of the 12th WSEAS international conference on Communications, p.241-248, July 23-25, 2008, Heraklion, Greece
|
|
|
|
|
|
Hui Xiong , Junjie Wu , Jian Chen, K-means clustering versus validation measures: a data-distribution perspective, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, v.39 n.2, p.318-331, April 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|