|
ABSTRACT
Document clustering is an important tool for text analysis and is used in many different applications. We propose to incorporate prior knowledge of cluster membership for document cluster analysis and develop a novel semi-supervised document clustering model. The method models a set of documents with weighted graph in which each document is represented as a vertex, and each edge connecting a pair of vertices is weighted with the similarity value of the two corresponding documents. The prior knowledge indicates pairs of documents that known to belong to the same cluster. Then, the prior knowledge is transformed into a set of constraints. The document clustering task is accomplished by finding the best cuts of the graph under the constraints. We apply the model to the Normalized Cut method to demonstrate the idea and concept. Our experimental evaluations show that the proposed document clustering model reveals remarkable performance improvements with very limited training samples, and hence is a very effective semi-supervised classification tool.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen , John W. Tukey, Scatter/Gather: a cluster-based approach to browsing large document collections, Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark
[doi> 10.1145/133160.133214]
|
| |
4
|
P. K. Chan, D. F. Schlag, and J. Y. Zien. Spectral k-way ratio-cut partitioning an clustering. IEEE Transaction Computer-Aided Design, 13:1088--1096, September 1994.
|
| |
5
|
|
| |
6
|
D. Eichmann, M. Ruiz, and P. Srinivasan. Cluster-Based adaptive and batch filtering. In Proceedings of the 7th Text Retrieval Conference. NIST, 2000
|
| |
7
|
G. H. Golub and C. F. Van Loan. Matrix Computations. John Hopkins Press, 1999
|
| |
8
|
J. Hartigan and M. Wong. A k-means clustering algorithm. Applied Statistics, 28:100--108, 1979
|
 |
9
|
|
| |
10
|
|
| |
11
|
T. Joachim. Transductive Learning via Spectral Graph Partitioning. Proceedings of the International Conference on Machine Learning, pp.290--297, 2003.
|
| |
12
|
R. S. Michalski and R. E. Stepp. Learning from observation: Conceptual clustering. Machine Learning, an Artificial Intelligence Approach, pages 331--363. Tioga Publishing Co., Palo Alto, CA, 1983.
|
| |
13
|
T. Mitchell. The role of unlabeled data in supervised learning. Proceedings of the Sixth International Colloquium on Cognitive Science, 1999.
|
| |
14
|
J.L.Neto, A.D.Santos, C.A.A. Kaestner, and A.A. Freitas. Document Clustering and Text Summarization. 4th International Conference on Practical Applications of Knowledge Discovery and Data Ming, London, 2000.
|
 |
15
|
|
 |
16
|
|
| |
17
|
|
| |
18
|
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1, 80--93.
|
| |
19
|
|
| |
20
|
|
 |
21
|
|
| |
22
|
S. X. Yu and J. Shi Grouping with Bias. Neural Information Processing Systems, 2001.
|
| |
23
|
|
 |
24
|
Hua-Jun Zeng , Qi-Cai He , Zheng Chen , Wei-Ying Ma , Jinwen Ma, Learning to cluster web search results, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, July 25-29, 2004, Sheffield, United Kingdom
[doi> 10.1145/1008992.1009030]
|
| |
25
|
H. Zha, C. Ding, M. Gu and H. Simon. Spectral relaxation for k-means clustering. Proceedings of Advances in Neural Information Processing Systems, vol 14, 2002.
|
| |
26
|
|
| |
27
|
X. Zhu, Z. Ghahramani, J. Lafferty. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. The Twentieth International Conference on Machine Learning, 2003.
|
CITED BY 8
|
|
Wenyuan Dai , Gui-Rong Xue , Qiang Yang , Yong Yu, Co-clustering based classification for out-of-domain documents, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
Yun Chi , Xiaodan Song , Dengyong Zhou , Koji Hino , Belle L. Tseng, Evolutionary spectral clustering by incorporating temporal smoothness, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Xiao Ling , Wenyuan Dai , Gui-Rong Xue , Qiang Yang , Yong Yu, Spectral domain-transfer learning, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|