|
ABSTRACT
As a prolific research area in data mining, subspace clustering and related problems induced a vast quantity of proposed solutions. However, many publications compare a new proposition—if at all—with one or two competitors, or even with a so-called “naïve” ad hoc solution, but fail to clarify the exact problem definition. As a consequence, even if two solutions are thoroughly compared experimentally, it will often remain unclear whether both solutions tackle the same problem or, if they do, whether they agree in certain tacit assumptions and how such assumptions may influence the outcome of an algorithm. In this survey, we try to clarify: (i) the different problem definitions related to subspace clustering in general; (ii) the specific difficulties encountered in this field of research; (iii) the varying assumptions, heuristics, and intuitions forming the basis of different approaches; and (iv) how several prominent solutions tackle different problems.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Achtert, E., Böhm, C., David, J., Kröger, P., and Zimek, A. 2008. Robust clustering in arbitrarily oriented subspaces. In Proceedings of the 8th SIAM International Conference on Data Mining (SDM).
|
| |
2
|
Achtert, E., Böhm, C., Kriegel, H.-P., Kröger, P., Müller-Gorman, I., and Zimek, A. 2007. Detection and visualization of subspace cluster hierarchies. In Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA).
|
 |
3
|
Elke Achtert , Christian Böhm , Hans-Peter Kriegel , Peer Kröger , Arthur Zimek, Deriving quantitative models for correlation clusters, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[doi> 10.1145/1150402.1150408]
|
| |
4
|
|
| |
5
|
Achtert, E., Böhm, C., Kriegel, H.-P., Kröger, P., and Zimek, A. 2007b. Robust, complete, and efficient correlation clustering. In Proceedings of the 7th SIAM International Conference on Data Mining (SDM).
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
 |
9
|
Charu C. Aggarwal , Joel L. Wolf , Philip S. Yu , Cecilia Procopiuc , Jong Soo Park, Fast algorithms for projected clustering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.61-72, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
10
|
|
 |
11
|
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.94-105, June 01-04, 1998, Seattle, Washington, United States
|
| |
12
|
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the ACM International Conference on Management of Data (SIGMOD).
|
 |
13
|
Mihael Ankerst , Markus M. Breunig , Hans-Peter Kriegel , Jörg Sander, OPTICS: ordering points to identify the clustering structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.49-60, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
14
|
|
 |
15
|
|
| |
16
|
|
 |
17
|
|
| |
18
|
Bellman, R. 1961. Adaptive Control Processes. A Guided Tour. Princeton University Press.
|
| |
19
|
|
 |
20
|
Amir Ben-Dor , Benny Chor , Richard Karp , Zohar Yakhini, Discovering local structure in gene expression data: the order-preserving submatrix problem, Proceedings of the sixth annual international conference on Computational biology, p.49-57, April 18-21, 2002, Washington, DC, USA
[doi> 10.1145/565196.565203]
|
| |
21
|
|
 |
22
|
Stefan Berchtold , Christian Böhm , Hans-Peter Kriegal, The pyramid-technique: towards breaking the curse of dimensionality, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.142-153, June 01-04, 1998, Seattle, Washington, United States
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
Bouveyron, C., Girard, S., and Schmid, C. 2007. High-dimensional data clustering. Comput. Statist. Data Anal. 52, 502--519.
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
| |
31
|
Böhm, C. and Kriegel, H.-P. 2000b. Efficient construction of large high-dimensional indexes. In Proceedings of the 16th International Conference on Data Engineering (ICDE).
|
| |
32
|
|
| |
33
|
|
 |
34
|
Chun-Hung Cheng , Ada Waichee Fu , Yi Zhang, Entropy-based subspace clustering for mining numerical data, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.84-93, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312199]
|
| |
35
|
Cheng, H., Hua, K. A., and Vu, K. 2008. Constrained locally weighted clustering. In Proceedings of the 34nd International Conference on Very Large Data Bases (VLDB).
|
| |
36
|
|
| |
37
|
Cho, H., Dhillon, I. S., Guan, Y., and Sra, S. 2004. Minimum sum-squared residue co-clustering of gene expression data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM).
|
 |
38
|
|
| |
39
|
Domeniconi, C., Papadopoulos, D., Gunopulos, D., and Ma, S. 2004. Subspace clustering of high dimensional data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM).
|
 |
40
|
|
| |
41
|
|
| |
42
|
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD).
|
 |
43
|
|
| |
44
|
|
 |
45
|
|
| |
46
|
Friedman, J. H. and Meulman, J. J. 2004. Clustering objects on subsets of attributes. J. Royal Statist. Soc. Series B (Statistical Methodology) 66, 4, 825--849.
|
| |
47
|
|
| |
48
|
|
| |
49
|
|
| |
50
|
Getz, G., Levine, E., and Domany, E. 2000. Coupled two-way clustering analysis of gene microarray data. Proc. National Academy Sci. United States Amer. 97, 22, 12079--12084.
|
 |
51
|
Aristides Gionis , Alexander Hinneburg , Spiros Papadimitriou , Panayiotis Tsaparas, Dimension induced clustering, Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
[doi> 10.1145/1081870.1081880]
|
| |
52
|
|
| |
53
|
|
| |
54
|
|
| |
55
|
Haralick, R. and Harpaz, R. 2005. Linear manifold clustering. In Proceedings of the 4th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM).
|
| |
56
|
|
| |
57
|
Harpaz, R. and Haralick, R. 2007a. Linear manifold correlation clustering. Int. J. Inf. Technol. Intell. Comput. 2, 2.
|
| |
58
|
Harpaz, R. and Haralick, R. 2007b. Mining subspace correlations. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM).
|
| |
59
|
Hartigan, J. A. 1972. Direct clustering of a data matrix. J. Amer. Statist. Assoc. 67, 337, 123--129.
|
| |
60
|
Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer.
|
| |
61
|
|
| |
62
|
Hough, P. V. C. 1962. Methods and means for recognizing complex patterns. U.S. patent 3069654.
|
| |
63
|
|
| |
64
|
|
 |
65
|
|
| |
66
|
|
| |
67
|
|
| |
68
|
Jolliffe, I. T. 2002. Principal Component Analysis, 2nd ed. Springer.
|
| |
69
|
Kailing, K., Kriegel, H.-P., and Kröger, P. 2004. Density-Connected subspace clustering for high-dimensional data. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM).
|
 |
70
|
|
| |
71
|
|
| |
72
|
|
| |
73
|
|
| |
74
|
|
| |
75
|
Kriegel, H.-P., Kröger, P., and Zimek, A. 2007. Detecting clusters in moderate-to-high-dimensional data: Subspace clustering, pattern-based clustering, and correlation clustering. Tutorial at the 7th International Conference on Data Mining (ICDM).
|
| |
76
|
Li, J., Huang, X., Selke, C., and Yong, J. 2007. A fast algorithm for finding correlation clusters in noise data. In Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD).
|
| |
77
|
|
 |
78
|
Bing Liu , Yiyuan Xia , Philip S. Yu, Clustering through decision tree construction, Proceedings of the ninth international conference on Information and knowledge management, p.20-29, November 06-11, 2000, McLean, Virginia, United States
[doi> 10.1145/354756.354775]
|
| |
79
|
Liu, G., Li, J., Sim, K., and Wong, L. 2007. Distance based subspace clustering with flexible dimension partitioning. In Proceedings of the 23th International Conference on Data Engineering (ICDE).
|
| |
80
|
|
| |
81
|
|
| |
82
|
Mirkin, B. 1996. Mathematical Classification and Clustering. Kluwer.
|
| |
83
|
Mitchell, T. M. 1997. Mach. Learn.. McGraw-Hill.
|
 |
84
|
|
| |
85
|
|
| |
86
|
|
| |
87
|
Murali, T. M. and Kasif, S. 2003. Extracting conserved gene expression motifs from gene expression data. In Proceedings of the 8th Pacific Symposium on Biocomputing (PSB).
|
| |
88
|
Nagesh, H., Goil, S., and Choudhary, A. 2001. Adaptive grids for clustering massive data sets. In Proceedings of the 1st SIAM International Conference on Data Mining (SDM).
|
| |
89
|
|
| |
90
|
Parros Machado de Sousa, E., Traina, C., Traina, A., and Faloutsos, C. 2002. How to use fractal dimension to find correlations between attributes. In Proceedings of the KDD-Workshop on Fractals and Self-Similarity in Data Mining: Issues and Approaches.
|
 |
91
|
|
| |
92
|
|
| |
93
|
|
| |
94
|
Amela Prelić , Stefan Bleuler , Philip Zimmermann , Anja Wille , Peter Bühlmann , Wilhelm Gruissem , Lars Hennig , Lothar Thiele , Eckart Zitzler, A systematic comparison and evaluation of biclustering methods for gene expression data, Bioinformatics, v.22 n.9, p.1122-1129, May 2006
[doi> 10.1093/bioinformatics/btl060]
|
 |
95
|
|
| |
96
|
|
| |
97
|
Segal, E., Taskar, B., Gasch, A., Friedman, N., and Koller, D. 2001. Rich probabilistic models for gene expression. Bioinf. 17, 1, S243--S252.
|
| |
98
|
|
| |
99
|
Sheng, Q., Moreau, Y., and De Moor, B. 2003. Biclustering microarray data by Gibbs sampling. Bioinf. 19, 2, ii196--ii205.
|
| |
100
|
|
| |
101
|
Slagle, J. L., Chang, C. L., and Heller, S. L. 1975. A clustering and data-reorganization algorithm. IEEE Trans. Syst. Man. Cybernetics 5, 121--128.
|
| |
102
|
|
| |
103
|
Tanay, A., Sharan, R., and Shamir, R. 2002. Discovering statistically significant biclusters in gene expression data. Bioinf. 18, 1, S136--S144.
|
| |
104
|
Tanay, A., Sharan, R., and Shamir, R. 2006. Biclustering algorithms: A survey. In Handbook of Computational Molecular Biology, S. Aluru, Ed. Chapman & Hall.
|
 |
105
|
|
| |
106
|
Van Mechelen, I., Bock, H.-H., and De Boeck, P. 2004. Two-Mode clustering methods: A structured overview. Statist. Methods Med. Res. 13, 363--394.
|
 |
107
|
|
 |
108
|
|
| |
109
|
|
| |
110
|
|
| |
111
|
Woo, K.-G., Lee, J.-H., Kim, M.-H., and Lee, Y.-J. 2004. FINDIT: A fast and intelligent subspace clustering algorithm using dimension voting. Inf. Softw. Technol. 46, 4, 255--271.
|
| |
112
|
|
| |
113
|
|
| |
114
|
|
| |
115
|
|
| |
116
|
|
|