| OCFS: optimal orthogonal centroid feature selection for text categorization |
| Full text |
Pdf
(298 KB)
|
| Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Salvador, Brazil
SESSION: Categorization and classification
table of contents
Pages: 122 - 129
Year of Publication: 2005
ISBN:1-59593-034-5
|
|
Authors
|
|
Jun Yan
|
Peking University, Beijing, P. R. China
|
|
Ning Liu
|
Tsinghua University, Beijing, P. R. China
|
|
Benyu Zhang
|
Microsoft Research Asia, Beijing, P. R. China
|
|
Shuicheng Yan
|
Chinese University of Hong Kong, Hong Kong
|
|
Zheng Chen
|
Microsoft Research Asia, Beijing, P. R. China
|
|
Qiansheng Cheng
|
Peking University, Beijing, P. R. China
|
|
Weiguo Fan
|
Virginia Polytechnic Institute and State University, Blacksburg, VA
|
|
Wei-Ying Ma
|
Microsoft Research Asia, Beijing, P. R. China
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 12, Downloads (12 Months): 92, Citation Count: 8
|
|
|
ABSTRACT
Text categorization is an important research area in many Information Retrieval (IR) applications. To save the storage space and computation time in text categorization, efficient and effective algorithms for reducing the data before analysis are highly desired. Traditional techniques for this purpose can generally be classified into feature extraction and feature selection. Because of efficiency, the latter is more suitable for text data such as web documents. However, many popular feature selection techniques such as Information Gain (IG) andχ2-test (CHI) are all greedy in nature and thus may not be optimal according to some criterion. Moreover, the performance of these greedy methods may be deteriorated when the reserved data dimension is extremely low. In this paper, we propose an efficient optimal feature selection algorithm by optimizing the objective function of Orthogonal Centroid (OC) subspace learning algorithm in a discrete solution space, called Orthogonal Centroid Feature Selection (OCFS). Experiments on 20 Newsgroups (20NG), Reuters Corpus Volume 1 (RCV1) and Open Directory Project (ODP) data show that OCFS is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Douglas Hardin , Ioannis Tsamardinos , Constantin F. Aliferis, A theoretical characterization of linear SVM-based feature selection, Proceedings of the twenty-first international conference on Machine learning, p.48, July 04-08, 2004, Banff, Alberta, Canada
[doi> 10.1145/1015330.1015421]
|
| |
3
|
|
 |
4
|
|
| |
5
|
Greengrass, E. Information Retrieval: A Survey, 30 November 2000.
|
| |
6
|
Howland, P. and Park, H. Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26 (8). 995--1006.
|
| |
7
|
James E. Gentle, J. Chambers, W. Eddy, W. Haerdle, S. Sheather and Tierney, L. Numerical Linear Algebra for Applications in Statistics. Springer-Verlag, Berlin, 1998.
|
| |
8
|
Jolliffe, I.T. Principal Component Analysis. New York: Spriger Verlag, 1986.
|
| |
9
|
Lang, K. and NewsWeeder, Learning to Filter Netnews. In Proceedings of the 12th International Conference on Machine Learning (ICML 95), (Morgan Kaufmann, 1995).
|
| |
10
|
|
| |
11
|
|
| |
12
|
Li, H., Jiang, T., and Zhang, K., Efficient and Robust Feature Extraction by Maximum Margin Criterion. In Proceedings of the Advances in Neural Information Processing Systems, (Vancouver, Canada, 2003), 97--104.
|
| |
13
|
|
| |
14
|
Jeon M., Park, H. and Rosen, J.B. Dimension Reduction Based on Centroids and Least Squares for Efficient Processing of Text Data, Minneapolis, MN, University of Minnesota.
|
| |
15
|
Malhi, A. and Gao, R.X. PCA-Based Feature Selection Scheme for Machine Defect Classification. IEEE Transactions on Instrumentation and Measurement, 53. 1517--1525.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
 |
19
|
Ran Gilad-Bachrach , Amir Navot , Naftali Tishby, Margin based feature selection - theory and algorithms, Proceedings of the twenty-first international conference on Machine learning, p.43, July 04-08, 2004, Banff, Alberta, Canada
[doi> 10.1145/1015330.1015352]
|
| |
20
|
|
| |
21
|
Roweis, S.T. and Saul, L.K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 2000 Dec 22; 290(5500):2323-6.
|
| |
22
|
Tenenbaum J.B., Silva, V.d. and Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science, 290. 2319--2323.
|
 |
23
|
|
 |
24
|
Jun Yan , Benyu Zhang , Shuicheng Yan , Qiang Yang , Hua Li , Zheng Chen , Wensi Xi , Weiguo Fan , Wei-Ying Ma , Qiansheng Cheng, IMMC: incremental maximum margin criterion, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
[doi> 10.1145/1014052.1014147]
|
| |
25
|
|
CITED BY 9
|
|
Tieyun Qian , Hui Xiong , Yuanzhen Wang , Enhong Chen, Adapting association patterns for text categorization: weaknesses and enhancements, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
Bin Cao , Dou Shen , Jian-Tao Sun , Qiang Yang , Zheng Chen, Feature selection in a kernel space, Proceedings of the 24th international conference on Machine learning, p.121-128, June 20-24, 2007, Corvalis, Oregon
|
|
|
Jun Yan , Benyu Zhang , Ning Liu , Shuicheng Yan , Qiansheng Cheng , Weiguo Fan , Qiang Yang , Wensi Xi , Zheng Chen, Effective and Efficient Dimensionality Reduction for Large-Scale and Streaming Data Preprocessing, IEEE Transactions on Knowledge and Data Engineering, v.18 n.3, p.320-333, March 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|