|
ABSTRACT
In this paper we explore the connection between clustering categorical data and entropy: clusters of similar poi lower entropy than those of dissimilar ones. We use this connection to design an incremental heuristic algorithm, COOLCAT, which is capable of efficiently clustering large data sets of records with categorical attributes, and data streams. In contrast with other categorical clustering algorithms published in the past, COOLCAT's clustering results are very stable for different sample sizes and parameter settings. Also, the criteria for clustering is a very intuitive one, since it is deeply rooted on the well-known notion of entropy. Most importantly, COOLCAT is well equipped to deal with clustering of data streams(continuously arriving streams of data point) since it is an incremental algorithm capable of clustering new points without having to look at every point that has been clustered so far. We demonstrate the efficiency and scalability of COOLCAT by a series of experiments on real and synthetic data sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
M.S. Aldenderfer and R.K. Blashfield. Cluster Analysis Sage Publications,(Sage University Paper series on Quantitative Applications in the Social Sciences, No. 44), 1984.
|
 |
2
|
|
 |
3
|
|
| |
4
|
R.B. Calinski and J. Harabasz.A dendrite method for cluster analysis. Communications in Statistics pages 1--27,1974.
|
| |
5
|
|
 |
6
|
Chun-Hung Cheng , Ada Waichee Fu , Yi Zhang, Entropy-based subspace clustering for mining numerical data, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.84-93, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312199]
|
| |
7
|
H.Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the Sum of Observations. Annals of Mathematical Statistics pages 493--509, 1952.
|
| |
8
|
DataGen. Data Generator: Perfect data for an imperfect world. http://www.datasetgenerator.com/.
|
| |
9
|
R. C. Dubes and A.K. Jain. Validity studies in clustering methodologies.Pattern Recognition pages 235--254, 1979.
|
| |
10
|
|
| |
11
|
M. Ester, H.P. Kriegel, and X. Wu. A density-based algorithm for discovering clusters in large spatial database with noise. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, Portland, Oregon August 1996.
|
 |
12
|
Venkatesh Ganti , Johannes Gehrke , Raghu Ramakrishnan, CACTUS—clustering categorical data using summaries, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.73-83, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312201]
|
| |
13
|
|
| |
14
|
|
| |
15
|
A. Gluck and J. Corter. Information, uncertainty, and the utility of categories. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society 1985.
|
| |
16
|
|
 |
17
|
Sudipto Guha , Rajeev Rastogi , Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States
|
| |
18
|
|
| |
19
|
E.H. Han, G. Karypis, V.Kumar,and B. Mobasher. Clustering based on association rule hypergraphs. In Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery June 1997.
|
| |
20
|
S. Hettich(librarian). UCI KDD Archive. http://kdd.ics.uci.edu/.
|
| |
21
|
|
| |
22
|
G.J McLachlan and K.E.Basford.Mixture Models Marcel Dekker,New York,1988.
|
| |
23
|
|
| |
24
|
J.C. Pincipe, D. Xu, and J. Fisher. Information theoretic learning. In S. Haykin, editor, Unsupervised Adaptive Filtering John Wiley & Sons, 2000.
|
| |
25
|
A. Renyi. On Measures of Entropy and Information. In Proc. of the Fourth Berkeley Symp. Math., Statistics, and Probability 1960.
|
| |
26
|
J. Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of Statistics 1983.
|
| |
27
|
|
| |
28
|
C. E. Shannon. A mathematical theory of communication. Bell System Techical Journal pages 379--423, 1948.
|
| |
29
|
C. S. Wallace and D. M. Boulton. An information measure for classification.The Computer Journal 11(2), 1968.
|
| |
30
|
C. S. Wallace and D. L. Dowe. Intrinsic classification by MML, the Snob program. In Proceedings of the 7th Australian Joint Conference on Artificial Intelligence 1994.
|
 |
31
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
CITED BY 25
|
|
Daniel Barbará , Yi Li , Julia Couto , Jia-Ling Lin , Sushil Jajodia, Bootstrapping a data mining intrusion detection system, Proceedings of the 2003 ACM symposium on Applied computing, March 09-12, 2003, Melbourne, Florida
|
|
|
Tao Li , Sheng Ma , Mitsunori Ogihara, Entropy-based criterion in categorical clustering, Proceedings of the twenty-first international conference on Machine learning, p.68, July 04-08, 2004, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Ronald Fagin , R. Guha , Ravi Kumar , Jasmine Novak , D. Sivakumar , Andrew Tomkins, Multi-structural databases, Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 13-15, 2005, Baltimore, Maryland
|
|
|
Mohammed J. Zaki , Markus Peters , Ira Assent , Thomas Seidl, CLICKS: an effective algorithm for mining subspace clusters in categorical datasets, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hui Xiong , Junjie Wu , Jian Chen, K-means clustering versus validation measures: a data-distribution perspective, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, v.39 n.2, p.318-331, April 2009
|
|
|
|
|