|
ABSTRACT
Support vector machine (SVM) learning algorithms focus on finding the hyperplane that maximizes the margin (the distance from the separating hyperplane to the nearest examples) since this criterion provides a good upper bound of the generalization error. When applied to text classification, these learning algorithms lead to SVMs with excellent precision but poor recall. Various relaxation approaches have been proposed to counter this problem including: asymmetric SVM learning algorithms (soft SVMs with asymmetric misclassification costs); uneven margin based learning; and thresholding. A review of these approaches is presented here. In addition, in this paper, we describe a new threshold relaxation algorithm. This approach builds on previous thresholding work based upon the beta-gamma algorithm. The proposed thresholding strategy is parameter free, relying on a process of retrofitting and cross validation to set algorithm parameters empirically, whereas our previous approach required the specification of two parameters (beta and gamma). The proposed approach is more efficient, does not require the specification of any parameters, and similarly to the parameter-based approach, boosts the performance of baseline SVMs by at least 20% for standard information retrieval measures.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Arampatzis A., Unbiased S-D Threshold Optimization, Initial Query Degradation, Decay, and Incrementality, for Adaptive Document Filtering, Tenth Text Retrieval Conference (TREC-2001), 2002, 596--605.
|
| |
2
|
Ault T., Yang Y., kNN, Rocchio and Metrics for Information Filtering at TREC-10, Tenth Text Retrieval Conference (TREC-2001), 2002, 84--93
|
| |
3
|
Cancedda N. et al., Kernel Methods for Document Filtering, Eleventh Text Retrieval Conference (TREC-11), 2003.
|
| |
4
|
|
| |
5
|
Evans, D. A., Shanahan, J., Tong, X., Roma, N., Stoica, E., Sheftel, V., Montgomery, J., Bennett, J., Fujita, S., Grefenstette, G. Topic Specific Optimization and Structuring. Tenth Text Retrieval Conference (TREC-2001), 2002, 132--141.
|
| |
6
|
|
| |
7
|
Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., Murthy, K. R. K. Improvements to Platt's SMO algorithm for SVM classifier design. Technical report, Dept of CSA, IISc, Bangalore, India, 1999.
|
| |
8
|
LeCun, Y., Jackel, L. D., Bottou, L., Cortes, C., Denker, J. S., Drucker, H., Guyon, I., Muller, U. A., Sackinger, E., Simard, P. and Vapnik, V. Learning algorithms for classification: A comparison on handwritten digit recognition. Neural Networks: The Statistical Mechanics Perspective, 261--276, 1995.
|
| |
9
|
Lewis D. D., The Reuters-21578 text categorization test collection. http://www.research.att.com/ lewis/reuters21578.html. Checked on 11 May 1998; Timestamp Tue Jan 20 21:07:21 EST 1998.
|
| |
10
|
Lewis D. D., Applying Support Vector Machines to the TREC-2001 Batch Filtering and Routing Tasks, Tenth Text Retrieval Conference (TREC-2001), 2002, 286--294.
|
| |
11
|
|
| |
12
|
Mayfield J., McNamee P., Costello C., Piatko C., Banerjee A., JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval, at TREC-10, Tenth Text Retrieval Conference (TREC-2001), 2002, 322--332.
|
| |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
Robertson S. E., Soboroff I., The TREC 2001 Filtering Track Report, Tenth Text Retrieval Conference (TREC-2001), 2002, 26--37.
|
| |
17
|
Robertson S. E., Walker S., Zaragoza H., Herbrich H., Microsoft Cambridge at TREC 2002: Filtering Track, Eleventh Text Retrieval Conference (TREC-2002), 2003.
|
| |
18
|
|
| |
19
|
Shanahan J. G., Roma N., Improving SVM Text Classification Performance through Threshold Adjustment, European Conference on Machine Learning (ECML) 2003, To Appear.
|
| |
20
|
|
| |
21
|
Vapnik, V., Statistical Learning Theory, Wiley, 1998
|
| |
22
|
Voorhees E.M., Overview of TREC 2002, Eleventh Text Retrieval Conference (TREC-2002), 2002, 1--16.
|
 |
23
|
|
| |
24
|
Zhai, C., Jansen, P., Stoica, E., Grot, N., Evans, D. A. Threshold Calibration in CLARIT Adaptive Filtering. Seventh Text Retrieval Conference (TREC-7), 1999, 149--156.
|
| |
25
|
Y. Zhang and J. Callan. "YFilter at TREC-9". In Proceedings of the Ninth Text REtrieval Conference (TREC-9), (pp. 135--140). National Institute of Standards and Technology, 2001, special publication 500-249.
|
CITED BY 2
|
|
Bingjun Sun , Qingzhao Tan , Prasenjit Mitra , C. Lee Giles, Extraction and search of chemical formulae in text documents on the web, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|