|
ABSTRACT
In situations where class labels are known for a part of the objects, a cluster analysis respecting this information, i.e. semi-supervised clustering, can give insight into the class and cluster structure of a data set. Several semi-supervised clustering algorithms such as HMRF-K-Means [4], COP-K-Means [26] and the CCL-algorithm [18] have recently been proposed. Most of them extend well-known clustering methods (K-Means [22], Complete Link [17] by enforcing two types of constraints: must-links between objects of the same class and cannot-links between objects of different classes. In this paper, we propose HISSCLU, a hierarchical, density-based method for semi-supervised clustering. Instead of deriving explicit constraints from the labeled objects, HISSCLU expands the clusters starting at all labeled objects simultaneously. During the expansion, class labels are assigned to the unlabeled objects most consistently with the cluster structure. Using this information the hierarchical cluster structure is determined. The result is visualized in a semi-supervised cluster diagram showing both cluster structure as well as class assignment. Compared to methods based on must-links and cannot-links, our method allows a better preservation of the actual cluster structure, particularly if the data set contains several distinct clusters of the same class (i.e. the intra-class data distribution is multimodal). HISSCLU has a determinate result, is efficient and robust against noise. The performance of our algorithm is shown in an extensive experimental evaluation on synthetic and real-world data sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
"WEKA machine learning package, http://www.cs.waikato.ac.nz/ml/weka". Universitity of Waikato.
|
 |
2
|
Mihael Ankerst , Markus M. Breunig , Hans-Peter Kriegel , Jörg Sander, OPTICS: ordering points to identify the clustering structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.49-60, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
3
|
C. Baumgartner , C. Böhm , D. Baumgartner , G. Marini , K. Weinberger , B. Olgemöller , B. Liebl , A. A. Roscher, Supervised machine learning techniques for the classification of metabolic disorders in newborns, Bioinformatics, v.20 n.17, p.2985-2996, November 2004
[doi> 10.1093/bioinformatics/bth343]
|
 |
4
|
|
 |
5
|
Mikhail Bilenko , Sugato Basu , Raymond J. Mooney, Integrating constraints and metric learning in semi-supervised clustering, Proceedings of the twenty-first international conference on Machine learning, p.11, July 04-08, 2004, Banff, Alberta, Canada
[doi> 10.1145/1015330.1015360]
|
| |
6
|
C. L. Blake and C. J. Merz. "UCI Repository of machine learning databases, http://www.ics.uci.edu/~mlearn/MLRepository.html".
|
| |
7
|
N. Cesa-Bianchi, C. Gentile, A. Tironi, and L. Zaniboni. Incremental algorithms for hierarchical classification. In NIPS, 2004.
|
 |
8
|
|
| |
9
|
B.-R. Dai, C.-R. Lin, and M.-S. Chen. On the techniques for data clustering with numerical constraints. In SDM Conference, 2003.
|
 |
10
|
Ofer Dekel , Joseph Keshet , Yoram Singer, Large margin hierarchical classification, Proceedings of the twenty-first international conference on Machine learning, p.27, July 04-08, 2004, Banff, Alberta, Canada
[doi> 10.1145/1015330.1015374]
|
| |
11
|
B. E. Dom. "An Information-Theoretic External Cluster-Validity Measure". In Research Report RJ 10219, IBM, 2001.
|
| |
12
|
L. Dong, E. Frank, and S. Kramer. Ensembles of balanced nested dichotomies for multi-class problems. In Proc. of PKDD Conference, pages 84--95, 2005.
|
| |
13
|
|
| |
14
|
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise". In KDD Conference, 1996.
|
| |
15
|
|
| |
16
|
J. Gracia-Bustos, J. Heitman, and M. Hall. "Nuclear protein localization". In Biochimica et Biophysica Acta, pages 1071:83--101, 1991.
|
| |
17
|
|
| |
18
|
|
| |
19
|
H. Li and M. Niranjan. "Outlier Detection in Benchmark Classification Tasks". In Proc. of International Conference on Accoustics, Speech ans Signal Processing, pages 557--560, 2006.
|
| |
20
|
|
| |
21
|
Z. Lu and T. Leen. "Semi-supervised Learning with Penalized Probabilistic Clustering". In NIPS 17, pages 849--856, 2005.
|
| |
22
|
J. MacQueen. "Some Methods for Classification and Analysis of Multivariate Observations". In 5th Berkeley Symp. Math. Statist. Prob., 1967.
|
| |
23
|
K. Nakai and M. Kanehisa. "A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells". Genomics, 14(897):897--911, 1991.
|
| |
24
|
|
| |
25
|
J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky. "Automatic Extraction of Clusters from Hierarchical Clustering Representations.". In PAKDD, 2003.
|
| |
26
|
|
| |
27
|
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. technical report., 2002.
|
| |
28
|
A. Zimek. Hierarchical classification using ensembles of nested dichotomies. Master's thesis, TU/LMU Munich, 2005.
|
|