ACM Home Page
Please provide us with feedback. Feedback
COOLCAT: an entropy-based algorithm for categorical clustering
Full text PdfPdf (826 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the eleventh international conference on Information and knowledge management table of contents
McLean, Virginia, USA
SESSION: Clustering algorithms table of contents
Pages: 582 - 589  
Year of Publication: 2002
ISBN:1-58113-492-4
Authors
Daniel Barbará  George Mason University, Fairfax, VA 22030
Yi Li  George Mason University, Fairfax, VA
Julia Couto  James Madison University, Harrisonburg, VA
Sponsors
SIGMIS: ACM Special Interest Group on Management Information Systems
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 107,   Citation Count: 24
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/584792.584888
What is a DOI?

ABSTRACT

In this paper we explore the connection between clustering categorical data and entropy: clusters of similar poi lower entropy than those of dissimilar ones. We use this connection to design an incremental heuristic algorithm, COOLCAT, which is capable of efficiently clustering large data sets of records with categorical attributes, and data streams. In contrast with other categorical clustering algorithms published in the past, COOLCAT's clustering results are very stable for different sample sizes and parameter settings. Also, the criteria for clustering is a very intuitive one, since it is deeply rooted on the well-known notion of entropy. Most importantly, COOLCAT is well equipped to deal with clustering of data streams(continuously arriving streams of data point) since it is an incremental algorithm capable of clustering new points without having to look at every point that has been clustered so far. We demonstrate the efficiency and scalability of COOLCAT by a series of experiments on real and synthetic data sets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
M.S. Aldenderfer and R.K. Blashfield. Cluster Analysis Sage Publications,(Sage University Paper series on Quantitative Applications in the Social Sciences, No. 44), 1984.
2
3
 
4
R.B. Calinski and J. Harabasz.A dendrite method for cluster analysis. Communications in Statistics pages 1--27,1974.
 
5
6
 
7
H.Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the Sum of Observations. Annals of Mathematical Statistics pages 493--509, 1952.
 
8
DataGen. Data Generator: Perfect data for an imperfect world. http://www.datasetgenerator.com/.
 
9
R. C. Dubes and A.K. Jain. Validity studies in clustering methodologies.Pattern Recognition pages 235--254, 1979.
 
10
 
11
M. Ester, H.P. Kriegel, and X. Wu. A density-based algorithm for discovering clusters in large spatial database with noise. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, Portland, Oregon August 1996.
12
 
13
 
14
 
15
A. Gluck and J. Corter. Information, uncertainty, and the utility of categories. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society 1985.
 
16
17
 
18
 
19
E.H. Han, G. Karypis, V.Kumar,and B. Mobasher. Clustering based on association rule hypergraphs. In Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery June 1997.
 
20
S. Hettich(librarian). UCI KDD Archive. http://kdd.ics.uci.edu/.
 
21
 
22
G.J McLachlan and K.E.Basford.Mixture Models Marcel Dekker,New York,1988.
 
23
 
24
J.C. Pincipe, D. Xu, and J. Fisher. Information theoretic learning. In S. Haykin, editor, Unsupervised Adaptive Filtering John Wiley & Sons, 2000.
 
25
A. Renyi. On Measures of Entropy and Information. In Proc. of the Fourth Berkeley Symp. Math., Statistics, and Probability 1960.
 
26
J. Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of Statistics 1983.
 
27
 
28
C. E. Shannon. A mathematical theory of communication. Bell System Techical Journal pages 379--423, 1948.
 
29
C. S. Wallace and D. M. Boulton. An information measure for classification.The Computer Journal 11(2), 1968.
 
30
C. S. Wallace and D. L. Dowe. Intrinsic classification by MML, the Snob program. In Proceedings of the 7th Australian Joint Conference on Artificial Intelligence 1994.
31

CITED BY  25

Collaborative Colleagues:
Daniel Barbará: colleagues
Yi Li: colleagues
Julia Couto: colleagues