|
ABSTRACT
Clustering, as an unsupervised learning process is a challenging problem, especially in cases of high-dimensional datasets. Clustering result quality can benefit from user constraints and objective validity assessment. In this article, we propose a semisupervised framework for learning the weighted Euclidean subspace, where the best clustering can be achieved. Our approach capitalizes on: (i) user constraints; and (ii) the quality of intermediate clustering results in terms of their structural properties. The proposed framework uses the clustering algorithm and the validity measure as its parameters. We develop and discuss algorithms for learning and tuning the weights of contributing dimensions and defining the “best” clustering obtained by satisfying user constraints. Experimental results on benchmark datasets demonstrate the superiority of the proposed approach in terms of improved clustering accuracy.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Charu C. Aggarwal , Joel L. Wolf , Philip S. Yu , Cecilia Procopiuc , Jong Soo Park, Fast algorithms for projected clustering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.61-72, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
2
|
|
 |
3
|
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.94-105, June 01-04, 1998, Seattle, Washington, United States
|
| |
4
|
|
| |
5
|
Bar-Hillel, A., Hertz, T., Shental, N., and Weinshall, D. 2003. Learning distance function using equivalence relations. In Proceedings of the International Conference on Machine Learning (ICML).
|
 |
6
|
|
| |
7
|
|
 |
8
|
Mikhail Bilenko , Sugato Basu , Raymond J. Mooney, Integrating constraints and metric learning in semi-supervised clustering, Proceedings of the twenty-first international conference on Machine learning, p.11, July 04-08, 2004, Banff, Alberta, Canada
[doi> 10.1145/1015330.1015360]
|
| |
9
|
|
 |
10
|
|
| |
11
|
Cohn, D., Caruana, R., and McCallum, A. 2003. Semi-Supervised clustering with user feedback. Tech. Rep. TR2003-1892, Cornell University, Ithaca, NY.
|
| |
12
|
Ester, M., Kriegel, H.-P., Sender, J., and Xu, X. 1997. Sensity-Connected sets and their application for trend detection in spatial databases. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 10--15.
|
| |
13
|
|
| |
14
|
|
| |
15
|
Frigui, H. and Nasraoui, O. 2004. Unsupervised learning of prototypes and attribute weights. Pattern Recogn. 37, 3, 943--952.
|
| |
16
|
Gao, J., Tan, P.-N., and Cheng, H. 2005. Semi-Supervised fuzzy clustering with pairwise-constrained competitive agglomeration. In IEEE Conference on Fuzzy Systems.
|
| |
17
|
|
| |
18
|
|
| |
19
|
Hinneburg, A. and Keim, D. 1998. An efficient approach toclustering in large multimedia databases with noise. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 58--65.
|
| |
20
|
Hogg, R. and Craig, A. 1978. Introduction to Mathematical Statistics. Macmillan, New York.
|
| |
21
|
Hubert, L. and Arabie, P. 1985. Comparing partitions. J. Classif.
|
 |
22
|
|
| |
23
|
Jing, L., Ng, M., and Huang, J. X. 2005. Subspace clustering of text documents with feature weighting k-means algorithm. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, vol. 3518. Springer, Berlin.
|
 |
24
|
Brian Kulis , Sugato Basu , Inderjit Dhillon , Raymond Mooney, Semi-supervised graph clustering: a kernel approach, Proceedings of the 22nd international conference on Machine learning, p.457-464, August 07-11, 2005, Bonn, Germany
[doi> 10.1145/1102351.1102409]
|
| |
25
|
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the Symposium on Math, Statistics and Probability, University of California Press, Berkeley, CA, 281--297.
|
| |
26
|
|
| |
27
|
|
| |
28
|
Segal, E., Wang, H., and Koller, D. 2003. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 19, 264--272.
|
| |
29
|
Stein, B., zu Eissen, S. M., and Wibrock, F. 2003. On cluster validity and the information need of users. In Proceedings of the Artificial Intelligenece and Applications Conference.
|
| |
30
|
|
| |
31
|
|
| |
32
|
Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. 2002. Distance metric learning, with application to clustering with side-information. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).
|
| |
33
|
|
|