ACM Home Page
Please provide us with feedback. Feedback
OCFS: optimal orthogonal centroid feature selection for text categorization
Full text PdfPdf (298 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Salvador, Brazil
SESSION: Categorization and classification table of contents
Pages: 122 - 129  
Year of Publication: 2005
ISBN:1-59593-034-5
Authors
Jun Yan  Peking University, Beijing, P. R. China
Ning Liu  Tsinghua University, Beijing, P. R. China
Benyu Zhang  Microsoft Research Asia, Beijing, P. R. China
Shuicheng Yan  Chinese University of Hong Kong, Hong Kong
Zheng Chen  Microsoft Research Asia, Beijing, P. R. China
Qiansheng Cheng  Peking University, Beijing, P. R. China
Weiguo Fan  Virginia Polytechnic Institute and State University, Blacksburg, VA
Wei-Ying Ma  Microsoft Research Asia, Beijing, P. R. China
Sponsor
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 91,   Citation Count: 9
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1076034.1076058
What is a DOI?

ABSTRACT

Text categorization is an important research area in many Information Retrieval (IR) applications. To save the storage space and computation time in text categorization, efficient and effective algorithms for reducing the data before analysis are highly desired. Traditional techniques for this purpose can generally be classified into feature extraction and feature selection. Because of efficiency, the latter is more suitable for text data such as web documents. However, many popular feature selection techniques such as Information Gain (IG) andχ2-test (CHI) are all greedy in nature and thus may not be optimal according to some criterion. Moreover, the performance of these greedy methods may be deteriorated when the reserved data dimension is extremely low. In this paper, we propose an efficient optimal feature selection algorithm by optimizing the objective function of Orthogonal Centroid (OC) subspace learning algorithm in a discrete solution space, called Orthogonal Centroid Feature Selection (OCFS). Experiments on 20 Newsgroups (20NG), Reuters Corpus Volume 1 (RCV1) and Open Directory Project (ODP) data show that OCFS is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
4
 
5
Greengrass, E. Information Retrieval: A Survey, 30 November 2000.
 
6
Howland, P. and Park, H. Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26 (8). 995--1006.
 
7
James E. Gentle, J. Chambers, W. Eddy, W. Haerdle, S. Sheather and Tierney, L. Numerical Linear Algebra for Applications in Statistics. Springer-Verlag, Berlin, 1998.
 
8
Jolliffe, I.T. Principal Component Analysis. New York: Spriger Verlag, 1986.
 
9
Lang, K. and NewsWeeder, Learning to Filter Netnews. In Proceedings of the 12th International Conference on Machine Learning (ICML 95), (Morgan Kaufmann, 1995).
 
10
 
11
 
12
Li, H., Jiang, T., and Zhang, K., Efficient and Robust Feature Extraction by Maximum Margin Criterion. In Proceedings of the Advances in Neural Information Processing Systems, (Vancouver, Canada, 2003), 97--104.
 
13
 
14
Jeon M., Park, H. and Rosen, J.B. Dimension Reduction Based on Centroids and Least Squares for Efficient Processing of Text Data, Minneapolis, MN, University of Minnesota.
 
15
Malhi, A. and Gao, R.X. PCA-Based Feature Selection Scheme for Machine Defect Classification. IEEE Transactions on Instrumentation and Measurement, 53. 1517--1525.
 
16
 
17
 
18
19
 
20
 
21
Roweis, S.T. and Saul, L.K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 2000 Dec 22; 290(5500):2323-6.
 
22
Tenenbaum J.B., Silva, V.d. and Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science, 290. 2319--2323.
23
24
 
25

CITED BY  9

Collaborative Colleagues:
Jun Yan: colleagues
Ning Liu: colleagues
Benyu Zhang: colleagues
Shuicheng Yan: colleagues
Zheng Chen: colleagues
Qiansheng Cheng: colleagues
Weiguo Fan: colleagues
Wei-Ying Ma: colleagues