|
Warning: The download time has expired please click on the item to try again.
ABSTRACT
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM-04), 2004.
|
| |
4
|
|
| |
5
|
A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning distance functions using equivalence relations. In Proceedings of 20th International Conference on Machine Learning (ICML-03), pages 11--18, 2003.
|
| |
6
|
|
| |
7
|
S. Basu, A. Banerjee, and R. J. Mooney. Active semi-supervision for pairwise constrained clustering. In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM-04), 2004.
|
| |
8
|
|
| |
9
|
J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, Series B (Methodological), 48(3):259--302, 1986.
|
 |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University, 2003.
|
| |
14
|
|
| |
15
|
A. Demiriz, K. P. Bennett, and M. J. Embrechts. Semi-supervised clustering using genetic algorithms. In Artificial Neural Networks in Engineering (ANNIE-99), pages 809--814, 1999.
|
| |
16
|
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1--38, 1977.
|
| |
17
|
|
| |
18
|
|
| |
19
|
B. E. Dom. An information-theoretic external cluster-validity measure. Research Report RJ 10219, IBM, 2001.
|
| |
20
|
M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, USA, 95:14863--14848, 1998.
|
| |
21
|
S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721--742, 1984.
|
| |
22
|
J. M. Hammersley and P. Clifford. Markov fields on finite graphs and lattices. Unpublished manuscript, 1971.
|
| |
23
|
D. Hochbaum and D. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, 10(2):180--184, 1985.
|
| |
24
|
|
| |
25
|
S. D. Kamvar, D. Klein, and C. D. Manning. Spectral learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-03), pages 561--566, 2003.
|
| |
26
|
M. Kearns, Y. Mansour, and A. Y. Ng. An information-theoretic analysis of hard and soft assignment methods for clustering. In Proceedings of 13th Conference on Uncertainty in Artificial Intelligence (UAI-97), pages 282--293, 1997.
|
| |
27
|
|
| |
28
|
|
| |
29
|
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.
|
| |
30
|
E. M. Marcotte, I. Xenarios, A. van der Bliek, and D. Eisenberg. Localizing proteins in the cell from their phylogenetic profiles. Proceedings of the National Academy of Science, 97:12115--20, 2000.
|
| |
31
|
K. V. Mardia and P. Jupp. Directional Statistics. John Wiley and Sons Ltd., 2nd edition, 2000.
|
| |
32
|
|
| |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
E. Segal, H. Wang, and D. Koller. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics, 19:i264--i272, July 2003.
|
| |
37
|
A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In AAAI 2000 Workshop on Artificial Intelligence for Web Search, pages 58--64, July 2000.
|
| |
38
|
|
| |
39
|
E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15, pages 505--512, Cambridge, MA, 2003. MIT Press.
|
| |
40
|
Y. Zhang, M. Brady, and S. Smith. Hidden Markov random field model and segmentation of brain MR images. IEEE Transactions on Medical Imaging, 20(1):45--57, 2001.
|
CITED BY 54
|
|
|
|
|
Arindam Banerjee , Chase Krumpelman , Joydeep Ghosh , Sugato Basu , Raymond J. Mooney, Model-based overlapping clustering, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
|
|
|
|
|
|
Brian Kulis , Sugato Basu , Inderjit Dhillon , Raymond Mooney, Semi-supervised graph clustering: a kernel approach, Proceedings of the 22nd international conference on Machine learning, p.457-464, August 07-11, 2005, Bonn, Germany
|
|
|
|
|
|
|
|
|
Rong Ge , Martin Ester , Byron J. Gao , Zengjian Hu , Binay Bhattacharya , Boaz Ben-Moshe, Joint cluster analysis of attribute data and relationship data: The connected k-center problem, algorithms and applications, ACM Transactions on Knowledge Discovery from Data (TKDD), v.2 n.2, p.1-35, July 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Wei Tang , Hui Xiong , Shi Zhong , Jie Wu, Enhancing semi-supervised clustering: a feature projection perspective, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
Jason V. Davis , Brian Kulis , Prateek Jain , Suvrit Sra , Inderjit S. Dhillon, Information-theoretic metric learning, Proceedings of the 24th international conference on Machine learning, p.209-216, June 20-24, 2007, Corvalis, Oregon
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jeffrey Erman , Anirban Mahanti , Martin Arlitt , Ira Cohen , Carey Williamson, Offline/realtime traffic classification using semi-supervised learning, Performance Evaluation, v.64 n.9-12, p.1194-1213, October, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jeffrey Davitz , Jiye Yu , Sugato Basu , David Gutelius , Alexandra Harris, iLink: search and routing in social networks, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jie Tang , Jing Zhang , Limin Yao , Juanzi Li , Li Zhang , Zhong Su, ArnetMiner: extraction and mining of academic social networks, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
Wenyuan Dai , Qiang Yang , Gui-Rong Xue , Yong Yu, Self-taught clustering, Proceedings of the 25th international conference on Machine learning, p.200-207, July 05-09, 2008, Helsinki, Finland
|
|
|
Yi Hong , Sam Kwong , Hui Xiong , Qingsheng Ren, Genetic-guided semi-supervised clustering algorithm with instance-level constraints, Proceedings of the 10th annual conference on Genetic and evolutionary computation, July 12-16, 2008, Atlanta, GA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ron Bekkerman , Hema Raghavan , James Allan , Koji Eguchi, Interactive clustering of text collections according to a user-specified criterion, Proceedings of the 20th international joint conference on Artifical intelligence, p.684-689, January 06-12, 2007, Hyderabad, India
|
|
|
|
|
|
|
|
|
|
|