| Rule-based word clustering for document metadata extraction |
| Full text |
Pdf
(142 KB)
|
| Source
|
Symposium on Applied Computing
archive
Proceedings of the 2005 ACM symposium on Applied computing
table of contents
Santa Fe, New Mexico
SESSION: Information access and retrieval (IAR)
table of contents
Pages: 1049 - 1053
Year of Publication: 2005
ISBN:1-58113-964-0
|
|
Authors
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 13, Downloads (12 Months): 78, Citation Count: 1
|
|
|
ABSTRACT
Text classification is still an important problem for unlabeled text; CiteSeer, a computer science document search engine, uses automatic text classification methods for document indexing. Text classification uses a document's original text words as the primary feature representation. However, such representation usually comes with high dimensionality and feature sparseness. Word clustering is an effective approach to reduce feature dimensionality and feature sparseness, and improve text classification performance. This paper introduces a domain Rule-based word clustering method for cluster feature representation. The clusters are formed from various domain databases and the word orthographic properties. Besides significant dimensionality reduction, such cluster feature representations show a 6.6% absolute improvement on average on classification performance of document header lines and a 8.4% absolute improvement on the overall accuracy of bibliographic fields extraction, in contrast to feature representation just based on the original text words. Our word clustering even outperforms the distributional word clustering in the context of document metadata extraction.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Daniel M. Bikel , Scott Miller , Richard Schwartz , Ralph Weischedel, Nymble: a high-performance learning name-finder, Proceedings of the fifth conference on Applied natural language processing, p.194-201, March 31-April 03, 1997, Washington, DC
[doi> 10.3115/974557.974586]
|
 |
3
|
|
| |
4
|
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
|
| |
5
|
|
| |
6
|
Eric J. Glover , Gary W. Flake , Steve Lawrence , Andries Kruger , David M. Pennock , William P. Birmingham , C. Lee Giles, Improving Category Specific Web Search by Learning Query Modifications, Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001), p.23, January 08-12, 2001
|
| |
7
|
Hui Han , C. Lee Giles , Eren Manavoglu , Hongyuan Zha , Zhenyue Zhang , Edward A. Fox, Automatic document metadata extraction using support vector machines, Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, May 27-31, 2003, Houston, Texas
|
| |
8
|
T. Hofmann. Probabilistic latent semantic analysis. In Proceedings of Uncertainty in Artificial Intelligence, 1999.
|
 |
9
|
|
| |
10
|
|
| |
11
|
T. Mitchell. Version spaces: A candidate elimination approach to rule learning. In Proceedings of the 5th International Joint Conference on Artificial Intelligence, pages 305--310, 1977.
|
| |
12
|
|
| |
13
|
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages 77(2):257--286, 1989.
|
| |
14
|
P. Schone and D. Jurafsky. Knowlege-free induction ofinflectional morphologies. In Proceedings of the North American chapter of the Association for Computational Linguistics (NAACL-2001), 2001.
|
| |
15
|
K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. In Proceedings of AAAI 99 Workshop on Machine Learning for Information Extraction, 1999.
|
| |
16
|
N. Slonim and N. Tishby. The power of word clusters for text classification. In Proceedings of the 23rd European Colloquium on Information Retrieval Research, 2001.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
|