ACM Home Page
Please provide us with feedback. Feedback
Document clustering with prior knowledge
Full text PdfPdf (172 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Seattle, Washington, USA
SESSION: Clustering table of contents
Pages: 405 - 412  
Year of Publication: 2006
ISBN:1-59593-369-7
Authors
Xiang Ji  Yahoo! Inc., Sunnyvale, CA
Wei Xu  NEC Labs America, Inc., Cupertino, CA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 33,   Downloads (12 Months): 281,   Citation Count: 8
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1148170.1148241
What is a DOI?

ABSTRACT

Document clustering is an important tool for text analysis and is used in many different applications. We propose to incorporate prior knowledge of cluster membership for document cluster analysis and develop a novel semi-supervised document clustering model. The method models a set of documents with weighted graph in which each document is represented as a vertex, and each edge connecting a pair of vertices is weighted with the similarity value of the two corresponding documents. The prior knowledge indicates pairs of documents that known to belong to the same cluster. Then, the prior knowledge is transformed into a set of constraints. The document clustering task is accomplished by finding the best cuts of the graph under the constraints. We apply the model to the Normalized Cut method to demonstrate the idea and concept. Our experimental evaluations show that the proposed document clustering model reveals remarkable performance improvements with very limited training samples, and hence is a very effective semi-supervised classification tool.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
 
4
P. K. Chan, D. F. Schlag, and J. Y. Zien. Spectral k-way ratio-cut partitioning an clustering. IEEE Transaction Computer-Aided Design, 13:1088--1096, September 1994.
 
5
 
6
D. Eichmann, M. Ruiz, and P. Srinivasan. Cluster-Based adaptive and batch filtering. In Proceedings of the 7th Text Retrieval Conference. NIST, 2000
 
7
G. H. Golub and C. F. Van Loan. Matrix Computations. John Hopkins Press, 1999
 
8
J. Hartigan and M. Wong. A k-means clustering algorithm. Applied Statistics, 28:100--108, 1979
9
 
10
 
11
T. Joachim. Transductive Learning via Spectral Graph Partitioning. Proceedings of the International Conference on Machine Learning, pp.290--297, 2003.
 
12
R. S. Michalski and R. E. Stepp. Learning from observation: Conceptual clustering. Machine Learning, an Artificial Intelligence Approach, pages 331--363. Tioga Publishing Co., Palo Alto, CA, 1983.
 
13
T. Mitchell. The role of unlabeled data in supervised learning. Proceedings of the Sixth International Colloquium on Cognitive Science, 1999.
 
14
J.L.Neto, A.D.Santos, C.A.A. Kaestner, and A.A. Freitas. Document Clustering and Text Summarization. 4th International Conference on Practical Applications of Knowledge Discovery and Data Ming, London, 2000.
15
16
 
17
 
18
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1, 80--93.
 
19
 
20
21
 
22
S. X. Yu and J. Shi Grouping with Bias. Neural Information Processing Systems, 2001.
 
23
24
 
25
H. Zha, C. Ding, M. Gu and H. Simon. Spectral relaxation for k-means clustering. Proceedings of Advances in Neural Information Processing Systems, vol 14, 2002.
 
26
 
27
X. Zhu, Z. Ghahramani, J. Lafferty. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. The Twentieth International Conference on Machine Learning, 2003.

CITED BY  8