|
ABSTRACT
Document classification presents difficult challenges due to the sparsity and the high dimensionality of text data, and to the complex semantics of the natural language. The traditional document representation is a word-based vector (Bag of Words, or BOW), where each dimension is associated with a term of the dictionary containing all the words that appear in the corpus. Although simple and commonly used, this representation has several limitations. It is essential to embed semantic information and conceptual patterns in order to enhance the prediction capabilities of classification algorithms. In this paper, we overcome the shortages of the BOW approach by embedding background knowledge derived from Wikipedia into a semantic kernel, which is then used to enrich the representation of documents. Our empirical evaluation with real data sets demonstrates that our approach successfully achieves improved classification accuracy with respect to the BOW technique, and to other recently developed methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
L. AlSumait and C. Domeniconi. Local Semantic Kernels for Text Document Clustering. In Workshop on Text Mining, SIAM International Conference on Data Mining, Minneapolis, MN, 2007. SIAM.
|
| |
2
|
R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, 2006.
|
| |
3
|
Carnegie Group, Inc. and Reuters, Ltd. Reuters-21578 text categorization test collection, 1997.
|
| |
4
|
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.
|
 |
5
|
|
| |
6
|
M. de Buenega Rodriguez, J. M. Gomez-Hidalgo, and B. Diaz-Agudo. Using wordnet to complement training information in text categorization. In International Conference on Recent Advances in Natural Language Processing, 1997.
|
 |
7
|
|
| |
8
|
E. Gabrilovich and S. Markovitch. Feature generation for text categorization using world knowledge. In International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, 2005.
|
| |
9
|
E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In National Conference on Artificial Intelligence (AAAI), Boston, Massachusetts, 2006.
|
| |
10
|
William Hersh , Chris Buckley , T. J. Leone , David Hickam, OHSUMED: an interactive retrieval evaluation and new large test collection for research, Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, p.192-201, July 03-06, 1994, Dublin, Ireland
|
| |
11
|
A. Hotho, S. Staab, and G. Stumme. Wordnet improves text document clustering. In Semantic Web Workshop, SIGIR Conference, Toronto, Canada, 2003. ACM.
|
| |
12
|
L. Jing, L. Zhou, M. K. Ng, and J. Z. Huang. Ontology-based distance measure for text clustering. In Workshop on Text Mining, SIAM International Conference on Data Mining, Bethesda, MD, 2006. SIAM.
|
| |
13
|
|
| |
14
|
K. Lang. Newsweeder: Learning to filter netnews. In International Conference on Machine Learning, Tahoe City, California, 1995. Morgan Kaufmann.
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
L. A. Urena-Lopez, M. Buenaga, and J. M. Gomez. Integrating linguistic resources in TC through WSD. Computers and the Humanities, 35:215--230, 2001.
|
| |
21
|
|
 |
22
|
S. K. M. Wong , Wojciech Ziarko , Patrick C. N. Wong, Generalized vector spaces model in information retrieval, Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval, p.18-25, June 05-07, 1985, Montreal, Quebec, Canada
[doi> 10.1145/253495.253506]
|
| |
23
|
|
CITED BY 2
|
|
Xiaohua Hu , Xiaodan Zhang , Caimei Lu , E. K. Park , Xiaohua Zhou, Exploiting Wikipedia as external knowledge for document clustering, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|