ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Data clustering: a review
Full text PdfPdf (636 KB)
Source ACM Computing Surveys (CSUR) archive
Volume 31 ,  Issue 3  (September 1999) table of contents
Pages: 264 - 323  
Year of Publication: 1999
ISSN:0360-0300
Authors
A. K. Jain  Michigan State Univ., East Lansing
M. N. Murty  Indian Institute of Science, Bangalore, India
P. J. Flynn  Ohio State Univ., Columbus
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 407,   Downloads (12 Months): 3431,   Citation Count: 606
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/331499.331504
What is a DOI?

ABSTRACT

Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
ACM, 1994. ACM CR Classifications. ACM Computing Surveys 35, 5-16.
 
3
AL-SULTAN, K.S. 1995. A tabu search approach to clustering problems. Pattern Recogn. 28, 1443-1451.
 
4
 
5
ALLEN, P. A. AND ALLEN, J. R. 1990. Basin Analysis: Principles and Applications. Blackwell Scientific Publications, Inc., Cambridge, MA.
 
6
ALTA VISTA, 1999. http://altavista.digital.com.
 
7
 
8
ANDERBERG, M. R. 1973. Cluster Analysis for Applications. Academic Press, Inc., New York, NY.
9
 
10
 
11
BABU, G. P. AND MURTY, M.N. 1994. Clustering with evolution strategies. Pattern Recogn. 27, 321-329.
 
12
BABU, G. P., MURTY, M. N., AND KEERTHI, S. S. 2000. Stochastic connectionist approach for pattern clustering (To appear). IEEE Trans. Syst. Man Cybern.
 
13
BACKER, F. B. AND HUBERT, L.g. 1976. A graphtheoretic approach to goodness-of-fit in complete-link hierarchical clustering. J. Am. Stat. Assoc. 71,870-878.
 
14
 
15
 
16
 
17
BALL, G. H. AND HALL, D.J. 1965. ISODATA, a novel method of data analysis and classification. Tech. Rep. Stanford University, Stanford, CA.
 
18
BENTLEY, J. L. AND FRIEDMAN, J.H. 1978. Fast algorithms for constructing minimal spanning trees in coordinate spaces. IEEE Trans. Comput. C-27, 6 (June), 97-105.
 
19
 
20
BHUYAN, J. N., RAGHAVAN, V. V., AND VENKATESH, K.E. 1991. Genetic algorithm for clustering with an ordered representation. In Proceedings of the Fourth International Conference on Genetic Algorithms, 408-415.
 
21
BISWAS, G., WEINBERG, J., AND LI, C. 1995. A Conceptual Clustering Method for Knowledge Discovery in Databases. Editions Technip.
 
22
 
23
BRODATZ, P. 1966. Textures: A Photographic Album for Artists and Designers. Dover Publications, Inc., Mineola, NY.
24
 
25
CARPENTER, G. AND GROSSBERG, S. 1990. ART3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures. Neural Networks 3, 129-152.
 
26
CHEKURI, C., GOLDWASSER, M. H., RAGHAVAN, P., AND UPFAL, E. 1997. Web search using automatic classification. In Proceedings of the Sixth International Conference on the World Wide Web (Santa Clara, CA, Apr.), http:// theory.stanford.edu/people/wass/publications/ Web Search/Web Search.html.
 
27
CHENG, C. H. 1995. A branch-and-bound clustering algorithm. IEEE Trans. Syst. Man Cybern. 25, 895-898.
 
28
CHENG, Y. AND FU, K.S. 1985. Conceptual clustering in knowledge organization. IEEE Trans. Pattern Anal. Mach. Intell. 7, 592-598.
 
29
 
30
 
31
 
32
1996. Special issue on data mining. Commun. ACM 39, 11.
 
33
COLEMAN, G. B. AND ANDREWS, H. C. 1979. Image segmentation by clustering. Proc. IEEE 67, 5, 773-785.
 
34
 
35
 
36
DALE, M. B. 1985. On the comparison of conceptual clustering and numerical taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 7, 241-244.
 
37
DAVE, R. N. 1992. Generalized fuzzy C-shells clustering and detection of circular and elliptic boundaries. Pattern Recogn. 25, 713-722.
 
38
DAVIS, T., Ed. 1991. The Handbook of Genetic Algorithms. Van Nostrand Reinhold Co., New York, NY.
 
39
DAY, W. H.E. 1992. Complexity theory: An introduction for practitioners of classification. In Clustering and Classification, P. Arabie and L. Hubert, Eds. World Scientific Publishing Co., Inc., River Edge, NJ.
 
40
DEMPSTER, A. P., LAIRD, N. M., AND RUB IN, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. B. 39, 1, 1-38.
 
41
DIDAY, E. 1973. The dynamic cluster method in non-hierarchical clustering. J. Comput. Inf. Sci. 2, 61-88.
 
42
DIDAY, E. AND SIMON, J. C. 1976. Clustering analysis. In Digital Pattern Recognition, K. S. Fu, Ed. Springer-Verlag, Secaucus, NJ, 47-94.
 
43
DIDAY, E. 1988. The symbolic approach in clustering. In Classification and Related Methods, H. H. Bock, Ed. North-Holland Publishing Co., Amsterdam, The Netherlands.
 
44
 
45
DUBES, R. C. AND JAIN, A. K. 1976. Clustering techniques: The user's dilemma. Pattern Recogn. 8, 247-260.
 
46
DUBES, R. C. AND JAIN, A. K. 1980. Clustering methodology in exploratory data analysis. In Advances in Computers, M. C. Yovits,, Ed. Academic Press, Inc., New York, NY, 113- 125.
 
47
 
48
 
49
DUBUISSON, M. P. AND JAIN, A.K. 1994. A modified Hausdorff distance for object matching. In Proceedings of the International Conference on Pattern Recognition (ICPR '94), 566-568.
 
50
 
51
DUNN, S., JANOS, L., AND ROSENFELD, A. 1983. Bimean clustering. Pattern Recogn. Lett. 1, 169-173.
 
52
DURAN, B. S. AND ODELL, P. L. 1974. Cluster Analysis: A Survey. Springer-Verlag, New York, NY.
 
53
54
 
55
EVERITT, B.S. 1993. Cluster Analysis. Edward Arnold, Ltd., London, UK.
 
56
FABER, V. 1994. Clustering and the continuous k-means algorithm. Los Alamos Science 22, 138-144.
 
57
FABER, V., HOCHBERG, J. C., KELLY, P. M., THOMAS, T. R., AND WHITE, J.M. 1994. Concept extraction: A data-mining technique. Los Alamos Science 22, 122-149.
 
58
 
59
 
60
 
61
 
62
FISHER, L. AND VAN NESS, J. W. 1971. Admissible clustering procedures. Biometrika 58, 91-104.
 
63
 
64
FOGEL, D. B. AND SIMPSON, P.K. 1993. Evolving fuzzy clusters. In Proceedings of the International Conference on Neural Networks (San Francisco, CA), 1829-1834.
 
65
FOGEL, D. B. AND FOGEL, L. J., Eds. 1994. Special issue on evolutionary computation. IEEE Trans. Neural Netw. (Jan.).
 
66
FOGEL, L. J., OWENS, A. J., AND WALSH, M. J. 1965. Artificial Intelligence Through Simulated Evolution. John Wiley and Sons, Inc., New York, NY.
 
67
 
68
FRED, A. L. N. AND LEITAO, J. M. N. 1996. A minimum code length technique for clustering of syntactic patterns. In Proceedings of the International Conference on Pattern Recognition (Vienna, Austria), 680-684.
 
69
 
70
Fu, K. S. AND LU, S.Y. 1977. A clustering procedure for syntactic patterns. IEEE Trans. Syst. Man Cybern. 7, 734-742.
 
71
Fu, K. S. AND MUI, J. K. 1981. A survey on image segmentation. Pattern Recogn. 13, 3-16.
 
72
 
73
 
74
 
75
GORDON, A. D. AND HENDERSON, J. T. 1977. Algorithm for Euclidean sum of squares. Biometrics 33, 355-362.
76
 
77
GOWDA, K. C. 1984. A feature reduction and unsupervised classification algorithm for multispectral data. Pattern Recogn. 17, 6, 667- 676.
 
78
GOWDA, K. C. AND KRISHNA, G. 1977. Agglomerative clustering using the concept of mutual nearest neighborhood. Pattern Recogn. 10, 105-112.
 
79
GOWDA, K. C. AND DIDAY, E. 1992. Symbolic clustering using a new dissimilarity meG- sure. IEEE Trans. Syst. Man Cybern. 22, 368-378.
 
80
GOWER, J. C. AND ROSS, G. J.S. 1969. Minimum spanning rees and single-linkage cluster analysis. Appl. Stat. 18, 54-64.
 
81
 
82
HARALICK, R. M. AND KELLY, G. L. 1969. Pattern recognition with measurement space and spatial clustering for multiple images. Proc. IEEE 57, 4, 654-665.
 
83
 
84
 
85
 
86
 
87
 
88
 
89
 
90
 
91
 
92
ICHINO, M. AND YAGUCHI, H. 1994. Generalized Minkowski metrics for mixed feature-type data analysis. IEEE Trans. Syst. Man Cybern. 24, 698-708.
 
93
1991. Proceedings of the International Joint Conference on Neural Networks. (IJCNN'91).
 
94
1992. Proceedings of the International Joint Conference on Neural Networks.
 
95
 
96
 
97
 
98
 
99
 
100
JAIN, A. K. AND MAO, J. 1994. Neural networks and pattern recognition. In Computational Intelligence: Imitating Life, J. M. Zurada, R. J. Marks, and C. J. Robinson, Eds. 194- 212.
 
101
JAIN, A. K. AND FLYNN, P.J. 1996. Image segmentation using clustering. In Advances in Image Understanding: A Festschrift for Azriel Rosenfeld, N. Ahuja and K. Bowyer, Eds, IEEE Press, Piscataway, NJ, 65-83.
 
102
 
103
JAIN, A. K., RATHA, N. K., AND LAKSHMANAN, S. 1997. Object detection using Gabor filters. Pattern Recogn. 30, 2, 295-309.
 
104
 
105
 
106
JARVIS, R. A. AND PATRICK, E. A. 1973. Clustering using a similarity method based on shared near neighbors. IEEE Trans. Comput. C-22, 8 (Aug.), 1025-1034.
 
107
 
108
JONES, D. AND BELTRAMO, M.A. 1991. Solving partitioning problems with genetic algorithms. In Proceedings of the Fourth International Conference on Genetic Algorithms, 442-449.
 
109
 
110
KING, B. 1967. Step-wise clustering procedures. J. Am. Stat. Assoc. 69, 86-101.
 
111
KIRKPATRICK, S., GELATT, C. D., JR., AND VECCHI, M.P. 1983. Optimization by simulated annealing. Science 220, 4598 (May), 671-680.
 
112
KLEIN, R. W. AND DUBES, R. C. 1989. Experiments in projection and clustering by simulated annealing. Pattern Recogn. 22, 213-220.
 
113
 
114
KOONTZ, W. L. G., FUKUNAGA, K., AND NARENDRA, P.M. 1975. A branch and bound clustering algorithm. IEEE Trans. Comput. 23, 908- 914.
 
115
 
116
KRAAIJVELD, M., MAO, J., AND JAIN, A. K. 1995. A non-linear projection method based on Kohonen's topology preserving maps. IEEE Trans. Neural Netw. 6, 548-559.
 
117
KRISHNAPURAM, R., FRIGUI, H., AND NASRAOUI, O. 1995. Fuzzy and probabilistic shell clustering algorithms and their application to boundary detection and surface approximation. IEEE Trans. Fuzzy Systems 3, 29-60.
 
118
 
119
LIBRARY OF CONGRESS, 1990. LC classification outline. Library of Congress, Washington, DC.
 
120
 
121
 
122
LEE, R. C. T., SLAGLE, J. R., AND MONG, C. T. 1978. Towards automatic auditing of records. IEEE Trans. Softw. Eng. 4, 441- 448.
 
123
LEE, R. C. T. 1981. Cluster analysis and its applications. In Advances in Information Systems Science, J. T. Tou, Ed. Plenum Press, New York, NY.
 
124
LI, C. AND BISWAS, G. 1995. Knowledge-based scientific discovery in geological databases. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (Montreal, Canada, Aug. 20-21), 204 -209.
 
125
Lu, S. Y. AND FU, K. S. 1978. A sentence-tosentence clustering procedure for pattern analysis. IEEE Trans. Syst. Man Cybern. 8, 381-389.
 
126
LUNDERVOLD, A., FENSTAD, A. M., ERSLAND, L., AND TAXT, T. 1996. Brain tissue volumes from multispectral 3D MRI: A comparative study of four classifiers. In Proceedings of the Conference of the Society on Magnetic Resonance,
 
127
 
128
MCQUEEN, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 281-297.
 
129
 
130
MAO, J. AND JAIN, A.K. 1995. Artificial neural networks for feature extraction and multivariate data projection. IEEE Trans. Neural Netw. 6, 296-317.
 
131
MAO, J. AND JAIN, A.K. 1996. A self-organizing network for hyperellipsoidal clustering (HEC). IEEE Trans. Neural Netw. 7, 16-29.
 
132
 
133
MICHALSKI, R., STEPP, R. E., AND DIDAY, E. 1981. A recent advance in data analysis: Clustering objects into classes characterized by conjunctive concepts. In Progress in Pattern Recognition, Vol. 1, L. Kanal and A. Rosenfeld, Eds. North-Holland Publishing Co., Amsterdam, The Netherlands.
 
134
MICHALSKI, R., STEPP, R. E., AND DIDAY, E. 1983. Automated construction of classifications: conceptual clustering versus numerical taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-5, 5 (Sept.), 396-409.
 
135
MISHRA, S. K. AND RAGHAVAN, V. V. 1994. An empirical study of the performance of heuristic methods for clustering. In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, Eds. 425-436.
 
136
 
137
MOHIUDDIN, K. M. AND MAO, g. 1994. A comparative study of different classifiers for handprinted character recognition. In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, Eds. 437-448.
 
138
MOOR, B.K. 1988. ART 1 and Pattern Clustering. In 1988 Connectionist Summer School, Morgan Kaufmann, San Mateo, CA, 174-185.
 
139
MURTAGH, F. 1984. A survey of recent advances in hierarchical clustering algorithms which use cluster centers. Comput. J. 26, 354-359.
 
140
MURTY, M. N. AND KRISHNA, G. 1980. A computationally efficient technique for data clustering. Pattern Recogn. 12, 153-158.
 
141
MURTY, M. N. AND JAIN, A.K. 1995. Knowledgebased clustering scheme for collection management and retrieval of library books. Pattern Recogn. 28, 949-964.
 
142
NAGY, G. 1968. State of the art in pattern recognition. Proc. IEEE 56, 836-862.
 
143
 
144
 
145
 
146
OJA, E. 1982. A simplified neuron model as a principal component analyzer. Bull. Math. Bio. 15, 267-273.
 
147
OZAWA, K. 1985. A stratificational overlapping cluster scheme. Pattern Recogn. 18, 279-286.
 
148
OPEN TEXT, 1999. http://index.opentext.net.
 
149
KAMGAR-PARSI, B., GUALTIERI, J. A., DEVANEY, J. A., AND KAMGAR-PARSI, K. 1990. Clustering with neural networks. Biol. Cybern. 63, 201-208.
 
150
LYCOS, 1999. http://www.lycos.com.
 
151
PAL, N. R., BEZDEK, J. C., AND TSAO, E. C.-K. 1993. Generalized clustering networks and Kohonen's self-organizing scheme. IEEE Trans. Neural Netw. 4, 549-557.
 
152
QUINLAN, J. R. 1990. Decision trees and decision making. IEEE Trans. Syst. Man Cybern. 20, 339-346.
153
 
154
RAGHAVAN, V. V. AND YU, C.T. 1981. A comparison of the stability characteristics of some graph theoretic clustering methods. IEEE Trans. Pattern Anal. Mach. Intell. 3, 393-402.
 
155
 
156
 
157
 
158
 
159
 
160
ROSENFELD, A., SCHNEIDER, V. B., AND HUANG, M. K. 1969. An application of cluster detection to text and picture processing. IEEE Trans. Inf. Theor. 15, 6, 672-681.
 
161
Ross, G. J. S. 1968. Classification techniques for large sets of data. In Numerical Taxonomy, A. J. Cole, Ed. Academic Press, Inc., New York, NY.
 
162
RuSPINI, E.H. 1969. A new approach to clustering. Inf. Control 15, 22-32.
 
163
SALTON, G. 1991. Developments in automatic text retrieval. Science 253, 974-980.
 
164
 
165
SAMMON, J. W. JR. 1969. A nonlinear mapping for data structure analysis. IEEE Trans. Comput. 18, 401-409.
 
166
 
167
SCHACHTER, B. J., DAVIS, L. S., AND ROSENFELD, A. 1979. Some experiments in image segmentation by clustering of local feature values. Pattern Recogn. 11, 19-28.
 
168
 
169
SELIM, S. Z. AND ISMAIL, M.A. 1984. K-meanstype algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 6, 81-87.
 
170
 
171
SEN, A. AND SRIVASTAVA, M. 1990. Regression Analysis. Springer-Verlag, New York, NY.
 
172
 
173
 
174
 
175
 
176
SLAGLE, J. R., CHANG, C. L., AND HELLER, S. R. 1975. A clustering and data-reorganizing algorithm. IEEE Trans. Syst. Man Cybern. 5, 125-128.
 
177
SNEATH, P. H. A. AND SOKAL, R. R. 1973. Numerical Taxonomy. Freeman, London, UK.
 
178
SPATH, H. 1980. Cluster Analysis Algorithms for Data Reduction and Classification. Ellis Horwood, Upper Saddle River, NJ.
 
179
SOLBERG, A., TAXT, T., AND JAIN, A. 1996. A Markov random field model for classification of multisource satellite imagery. IEEE Trans. Geoscience and Remote Sensing 34, 1, 100-113.
 
180
 
181
STAHL, H. 1986. Cluster analysis of large data sets. In Classification as a Tool of Research, W. Gaul and M. Schader, Eds. Elsevier North-Holland, Inc., New York, NY, 423-430.
 
182
 
183
SUTTON, M., STARK, L., AND BOWYER, K. 1993. Function-based generic recognition for multiple object categories. In Three-Dimensional Object Recognition Systems, A. Jain and P. J. Flynn, Eds. Elsevier Science Inc., New York, NY.
 
184
SYMON, M. J. 1977. Clustering criterion and multi-variate normal mixture. Biometrics 77, 35-43.
 
185
TANAKA, E. 1995. Theoretical aspects of syntactic pattern recognition. Pattern Recogn. 28, 1053-1061.
 
186
TAXT, T. AND LUNDERVOLD, A. 1994. Multispectral analysis of the brain using magnetic resonance imaging. IEEE Trans. Medical Imaging 13, 3, 470-481.
 
187
TITTERINGTON, D. M., SMITH, A. F. M., AND MAKOV, U.E. 1985. Statistical Analysis of Finite Mixture Distributions. John Wiley and Sons, Inc., New York, NY.
 
188
TOUSSAINT, G. T. 1980. The relative neighborhood graph of a finite planar set. Pattern Recogn. 12, 261-268.
 
189
 
190
 
191
URQUHART, R.B. 1982. Graph theoretical clustering based on limited neighborhood sets. Pattern Recogn. 15, 173-187.
 
192
 
193
VINOD, V. V., CHAUDHURY, S., MUKHERJEE, J., AND GHOSE, S. 1994. A connectionist approach for clustering with applications in image analysis. IEEE Trans. Syst. Man Cybern. 24, 365-384.
 
194
WAH, B. W., Ed. 1996. Special section on mining of databases. IEEE Trans. Knowl. Data Eng. (Dec.).
 
195
WARD, J. H. JR. 1963. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236-244.
 
196
 
197
WESZKA, J. 1978. A survey of threshold selection techniques. Pattern Recogn. 7, 259-265.
 
198
 
199
WILSON, D. R. AND MARTINEZ, T. R. 1997. Improved heterogeneous distance functions. J. Artif Intell. Res. 6, 1-34.
 
200
 
201
 
202
ZADEH, L.A. 1965. Fuzzy sets. Inf. Control 8, 338 -353.
 
203
ZAHN, C. T. 1971. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans. Comput. C-20 (Apr.), 68-86.
 
204
ZHANG, K. 1995. Algorithms for the constrained editing distance between ordered labeled trees and related problems. Pattern Recogn. 28, 463-474.
 
205
206
 
207
ZUPAN, J. 1982. Clustering of Large Data Sets. Research Studies Press Ltd., Taunton, UK.

CITED BY  609


REVIEW

"Jose M. Ramirez : Reviewer"

Data clustering is not defined the same way in each of the disciplines that use it to deal with problems that involve the extraction of information or structure from data. The authors have produced a good survey of this slippery topic. They d  more...

Collaborative Colleagues:
A. K. Jain: colleagues
M. N. Murty: colleagues
P. J. Flynn: colleagues