ACM Home Page
Please provide us with feedback. Feedback
Topic difference factor extraction between two document sets and its application to text categorization
Full text PdfPdf (249 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Tampere, Finland
SESSION: Text Categorization table of contents
Pages: 137 - 144  
Year of Publication: 2002
ISBN:1-58113-561-0
Author
Takahiko Kawatani  Hewlett-Packard Labs Japan, Tokyo, Japan
Sponsor
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 34,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/564376.564402
What is a DOI?

ABSTRACT

To improve performance in text categorization, it is important to extract distinctive features for each class. This paper proposes topic difference factor analysis (TDFA) as a method to extract projection axes that reflect topic differences between two document sets. Suppose all sentence vectors that compose each document are projected onto projection axes. TDFA obtains the axes that maximize the ratio between the document sets as to the sum of squared projections by solving a generalized eigenvalue problem. The axes are called topic difference factors (TDF's). By applying TDFA to the document set that belongs to a given class and a set of documents that is misclassified as belonging to that class by an existent classifier, we can obtain features that take large values in the given class but small ones in other classes, as well as features that take large values in other classes but small ones in the given class. A classifier was constructed applying the above features to complement the kNN classifier. As the results, the micro averaged F1 measure for Reuters-21578 improved from 83.69 to 87.27%.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
 
4
5
 
6
 
7
J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The Smart Retrieval System-Experiments in Automatic Document Processing, pp. 313--323, Prentice-Hall, 1971.
 
8
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis, John Wiley & Sons Inc., 1973.
 
9
 
10
G. J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition, John Wiley & Sons, Inc., 1992.
 
11
 
12
J. H. Friedman. Regularized Discriminant Analysis. J. Amer. Statist. Assoc. 84, pp.165--175, 1989.