ACM Home Page
Please provide us with feedback. Feedback
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering
Full text PdfPdf (1.75 MB)
Source
ACM Transactions on Knowledge Discovery from Data (TKDD) archive
Volume 3 ,  Issue 1  (March 2009) table of contents
Article No. 1  
Year of Publication: 2009
ISSN:1556-4681
Authors
Hans-Peter Kriegel  Ludwig-Maximilians-Universität München, Munich, Germany
Peer Kröger  Ludwig-Maximilians-Universität München, Munich, Germany
Arthur Zimek  Ludwig-Maximilians-Universität München, Munich, Germany
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 299,   Downloads (12 Months): 1399,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1497577.1497578
What is a DOI?

ABSTRACT

As a prolific research area in data mining, subspace clustering and related problems induced a vast quantity of proposed solutions. However, many publications compare a new proposition—if at all—with one or two competitors, or even with a so-called “naïve” ad hoc solution, but fail to clarify the exact problem definition. As a consequence, even if two solutions are thoroughly compared experimentally, it will often remain unclear whether both solutions tackle the same problem or, if they do, whether they agree in certain tacit assumptions and how such assumptions may influence the outcome of an algorithm. In this survey, we try to clarify: (i) the different problem definitions related to subspace clustering in general; (ii) the specific difficulties encountered in this field of research; (iii) the varying assumptions, heuristics, and intuitions forming the basis of different approaches; and (iv) how several prominent solutions tackle different problems.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Achtert, E., Böhm, C., David, J., Kröger, P., and Zimek, A. 2008. Robust clustering in arbitrarily oriented subspaces. In Proceedings of the 8th SIAM International Conference on Data Mining (SDM).
 
2
Achtert, E., Böhm, C., Kriegel, H.-P., Kröger, P., Müller-Gorman, I., and Zimek, A. 2007. Detection and visualization of subspace cluster hierarchies. In Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA).
3
 
4
 
5
Achtert, E., Böhm, C., Kriegel, H.-P., Kröger, P., and Zimek, A. 2007b. Robust, complete, and efficient correlation clustering. In Proceedings of the 7th SIAM International Conference on Data Mining (SDM).
 
6
 
7
 
8
9
10
11
 
12
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the ACM International Conference on Management of Data (SIGMOD).
13
 
14
15
 
16
17
 
18
Bellman, R. 1961. Adaptive Control Processes. A Guided Tour. Princeton University Press.
 
19
20
 
21
22
 
23
 
24
 
25
 
26
 
27
Bouveyron, C., Girard, S., and Schmid, C. 2007. High-dimensional data clustering. Comput. Statist. Data Anal. 52, 502--519.
 
28
29
 
30
 
31
Böhm, C. and Kriegel, H.-P. 2000b. Efficient construction of large high-dimensional indexes. In Proceedings of the 16th International Conference on Data Engineering (ICDE).
 
32
 
33
34
 
35
Cheng, H., Hua, K. A., and Vu, K. 2008. Constrained locally weighted clustering. In Proceedings of the 34nd International Conference on Very Large Data Bases (VLDB).
 
36
 
37
Cho, H., Dhillon, I. S., Guan, Y., and Sra, S. 2004. Minimum sum-squared residue co-clustering of gene expression data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM).
38
 
39
Domeniconi, C., Papadopoulos, D., Gunopulos, D., and Ma, S. 2004. Subspace clustering of high dimensional data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM).
40
 
41
 
42
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD).
43
 
44
45
 
46
Friedman, J. H. and Meulman, J. J. 2004. Clustering objects on subsets of attributes. J. Royal Statist. Soc. Series B (Statistical Methodology) 66, 4, 825--849.
 
47
 
48
 
49
 
50
Getz, G., Levine, E., and Domany, E. 2000. Coupled two-way clustering analysis of gene microarray data. Proc. National Academy Sci. United States Amer. 97, 22, 12079--12084.
51
 
52
 
53
 
54
 
55
Haralick, R. and Harpaz, R. 2005. Linear manifold clustering. In Proceedings of the 4th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM).
 
56
 
57
Harpaz, R. and Haralick, R. 2007a. Linear manifold correlation clustering. Int. J. Inf. Technol. Intell. Comput. 2, 2.
 
58
Harpaz, R. and Haralick, R. 2007b. Mining subspace correlations. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM).
 
59
Hartigan, J. A. 1972. Direct clustering of a data matrix. J. Amer. Statist. Assoc. 67, 337, 123--129.
 
60
Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer.
 
61
 
62
Hough, P. V. C. 1962. Methods and means for recognizing complex patterns. U.S. patent 3069654.
 
63
 
64
65
 
66
 
67
 
68
Jolliffe, I. T. 2002. Principal Component Analysis, 2nd ed. Springer.
 
69
Kailing, K., Kriegel, H.-P., and Kröger, P. 2004. Density-Connected subspace clustering for high-dimensional data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM).
70
 
71
 
72
 
73
 
74
 
75
Kriegel, H.-P., Kröger, P., and Zimek, A. 2007. Detecting clusters in moderate-to-high-dimensional data: Subspace clustering, pattern-based clustering, and correlation clustering. Tutorial at the 7th International Conference on Data Mining (ICDM).
 
76
Li, J., Huang, X., Selke, C., and Yong, J. 2007. A fast algorithm for finding correlation clusters in noise data. In Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD).
 
77
78
 
79
Liu, G., Li, J., Sim, K., and Wong, L. 2007. Distance based subspace clustering with flexible dimension partitioning. In Proceedings of the 23th International Conference on Data Engineering (ICDE).
 
80
 
81
 
82
Mirkin, B. 1996. Mathematical Classification and Clustering. Kluwer.
 
83
Mitchell, T. M. 1997. Mach. Learn.. McGraw-Hill.
84
 
85
 
86
 
87
Murali, T. M. and Kasif, S. 2003. Extracting conserved gene expression motifs from gene expression data. In Proceedings of the 8th Pacific Symposium on Biocomputing (PSB).
 
88
Nagesh, H., Goil, S., and Choudhary, A. 2001. Adaptive grids for clustering massive data sets. In Proceedings of the 1st SIAM International Conference on Data Mining (SDM).
 
89
 
90
Parros Machado de Sousa, E., Traina, C., Traina, A., and Faloutsos, C. 2002. How to use fractal dimension to find correlations between attributes. In Proceedings of the KDD-Workshop on Fractals and Self-Similarity in Data Mining: Issues and Approaches.
91
 
92
 
93
 
94
95
 
96
 
97
Segal, E., Taskar, B., Gasch, A., Friedman, N., and Koller, D. 2001. Rich probabilistic models for gene expression. Bioinf. 17, 1, S243--S252.
 
98
 
99
Sheng, Q., Moreau, Y., and De Moor, B. 2003. Biclustering microarray data by Gibbs sampling. Bioinf. 19, 2, ii196--ii205.
 
100
 
101
Slagle, J. L., Chang, C. L., and Heller, S. L. 1975. A clustering and data-reorganization algorithm. IEEE Trans. Syst. Man. Cybernetics 5, 121--128.
 
102
 
103
Tanay, A., Sharan, R., and Shamir, R. 2002. Discovering statistically significant biclusters in gene expression data. Bioinf. 18, 1, S136--S144.
 
104
Tanay, A., Sharan, R., and Shamir, R. 2006. Biclustering algorithms: A survey. In Handbook of Computational Molecular Biology, S. Aluru, Ed. Chapman & Hall.
105
 
106
Van Mechelen, I., Bock, H.-H., and De Boeck, P. 2004. Two-Mode clustering methods: A structured overview. Statist. Methods Med. Res. 13, 363--394.
107
108
 
109
 
110
 
111
Woo, K.-G., Lee, J.-H., Kim, M.-H., and Lee, Y.-J. 2004. FINDIT: A fast and intelligent subspace clustering algorithm using dimension voting. Inf. Softw. Technol. 46, 4, 255--271.
 
112
 
113
 
114
 
115
 
116

Collaborative Colleagues:
Hans-Peter Kriegel: colleagues
Peer Kröger: colleagues
Arthur Zimek: colleagues