ACM Home Page
Please provide us with feedback. Feedback
Text categorization by boosting automatically extracted concepts
Full text PdfPdf (238 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval table of contents
Toronto, Canada
SESSION: Text categorization table of contents
Pages: 182 - 189  
Year of Publication: 2003
ISBN:1-58113-646-3
Authors
Lijuan Cai  Brown University, Providence, RI
Thomas Hofmann  Brown University, Providence, RI
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 18,   Downloads (12 Months): 111,   Citation Count: 15
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/860435.860470
What is a DOI?

ABSTRACT

Term-based representations of documents have found wide-spread use in information retrieval. However, one of the main shortcomings of such methods is that they largely disregard lexical semantics and, as a consequence, are not sufficiently robust with respect to variations in word usage.In this paper we investigate the use of concept-based document representations to supplement word- or phrase-based features. The utilized concepts are automatically extracted from documents via probabilistic latent semantic analysis. We propose to use AdaBoost to optimally combine weak hypotheses based on both types of features. Experimental results on standard benchmarks confirm the validity of our approach, showing that AdaBoost achieves consistent improvements by including additional semantic features in the learned ensemble.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
4
 
5
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6):391--407, 1990.
 
6
 
7
 
8
T. Hofmann. Probmap -a probabilistic approach for mapping large document collections.Journal for Intelligent Data Analysis 4:149--164, 2000.
 
9
 
10
 
11
 
12
S. T. Dumais. Using LSI for information filtering: TREC-3 experiments. In D. Harman, editor, The Third Text REtrieval Conference (TREC3) NIST Special Publication 1995.
13
 
14
 
15
J. Kandola, N. Cristianini, and J. Shawe-Taylor. Learning semantic similarity. In Advances in Neural Information Processing Systems (to appear) volume 15, 2003.
 
16
T. Hofmann. Learning the similarity of documents. In MIT Press, editor, Advances in Neural Information Processing Systems volume 12, 2000.
 
17
18
19
 
20
David Lewis. Reuters-21578 dataset.
 
21

CITED BY  15

Collaborative Colleagues:
Lijuan Cai: colleagues
Thomas Hofmann: colleagues