ACM Home Page
Please provide us with feedback. Feedback
Exploiting Wikipedia as external knowledge for document clustering
Full text MovMov (20:11),  PdfPdf (585 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Paris, France
SESSION: Research track papers table of contents
Pages 389-396  
Year of Publication: 2009
ISBN:978-1-60558-495-9
Authors
Xiaohua Hu  Drexel University, Philadelphia, PA, USA
Xiaodan Zhang  Drexel University, Philadelphia, PA, USA
Caimei Lu  Drexel University, Philadelphia, PA, USA
E. K. Park  University of Missouri at Kansas City, Kansas City, MO, USA
Xiaohua Zhou  Drexel University, Philadelphia, PA, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 126,   Downloads (12 Months): 333,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1557019.1557066
What is a DOI?

ABSTRACT

In traditional text clustering methods, documents are represented as "bags of words" without considering the semantic information of each document. For instance, if two documents use different collections of core words to represent the same topic, they may be falsely assigned to different clusters due to the lack of shared core words, although the core words they use are probably synonyms or semantically associated in other forms. The most common way to solve this problem is to enrich document representation with the background knowledge in an ontology. There are two major issues for this approach: (1) the coverage of the ontology is limited, even for WordNet or Mesh, (2) using ontology terms as replacement or additional features may cause information loss, or introduce noise. In this paper, we present a novel text clustering method to address these two issues by enriching document representation with Wikipedia concept and category information. We develop two approaches, exact match and relatedness-match, to map text documents to Wikipedia concepts, and further to Wikipedia categories. Then the text documents are clustered based on a similarity metric which combines document content information, concept information as well as category information. The experimental results using the proposed clustering framework on three datasets (20-newsgroup, TDT2, and LA Times) show that clustering performance improves significantly by enriching document representation with Wikipedia concepts and categories.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. (Hyderabad, India, January 6-12, 2007). 1606--1611.
 
4
Hotho, A., Staab, S.and Stumme, G. 2003. Wordnet improves text document clustering. In Proceedings of Semantic Web Workshop, the 26th annual International ACM SIGIR Conference. (Toronto, Canada, Jul. 28-Aug.1, 2003).
 
5
6
7
 
8
Milne, D. 2007. Computing Semantic Relatedness using Wikipedia Link Structure. In Proceedings of the 5th New Zealand Computer Science Research Student Conference. (Hamilton, New Zealand, April 10-13, 2007).
9
 
10
Steinbach, M., Karypis, G. and Kumar, V. 2000. A Comparison of document clustering techniques. Technical Report. Department of Computer Science and Engineering, University of Minnesota.
11
12
 
13
Zhang, X., Jing, L., Hu, X., et al. A Comparative Study of Ontology Based Term Similarity Measures on Document Clustering. In Proceedings of 12th International conference on Database Systems for Advanced Applications. (Bangkok, Thailand, April 9-12, 2007).115--126.
 
14
Zhao, Y. and Karypis, G. 2001. Criterion functions for document clustering: experiments and analysis, Technical Report. Department of Computer Science, University of Minnesota.
 
15

Collaborative Colleagues:
Xiaohua Hu: colleagues
Xiaodan Zhang: colleagues
Caimei Lu: colleagues
E. K. Park: colleagues
Xiaohua Zhou: colleagues