ACM Home Page
Please provide us with feedback. Feedback
FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems
Full text PdfPdf (781 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Las Vegas, Nevada, USA
SESSION: Research papers table of contents
Pages 124-132  
Year of Publication: 2008
ISBN:978-1-60558-193-4
Authors
Xue-wen Chen  The University of Kansas, Lawrence, KS, USA
Michael Wasikowski  The University of Kansas, Lawrence, KS, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 34,   Downloads (12 Months): 352,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1401890.1401910
What is a DOI?

ABSTRACT

The class imbalance problem is encountered in a large number of practical applications of machine learning and data mining, for example, information retrieval and filtering, and the detection of credit card fraud. It has been widely realized that this imbalance raises issues that are either nonexistent or less severe compared to balanced class cases and often results in a classifier's suboptimal performance. This is even more true when the imbalanced data are also high dimensional. In such cases, feature selection methods are critical to achieve optimal performance. In this paper, we propose a new feature selection method, Feature Assessment by Sliding Thresholds (FAST), which is based on the area under a ROC curve generated by moving the decision boundary of a single feature classifier with thresholds placed using an even-bin distribution. FAST is compared to two commonly-used feature selection methods, correlation coefficient and RELevance In Estimating Features (RELIEF), for imbalanced data classification. The experimental results obtained on text mining, mass spectrometry, and microarray data sets showed that the proposed method outperformed both RELIEF and correlation methods on skewed data sets and was comparable on balanced data sets; when small number of features is preferred, the classification performance of the proposed method was significantly improved compared to correlation and RELIEF-based methods.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Casasent, D. and Chen, X.-W. 2003. New training strategies for RBF neural networks for X-ray agricultural product inspection. Pattern Recognition, 36(2), 535--547.
 
3
 
4
Casasent, D. and Chen, X.-W. 2004. Feature reduction and morphological processing for hyperspectral image data. Applied Optics, 43 (2), 1--10.
 
5
Japkowicz, N. editor 2000. Proceedings of the AAAI'2000 Workshop on Learning from Imbalanced Data Sets. AAAI Tech Report WS-00-05.
 
6
Chawla, N., Japkowicz, N., and Kolcz, A. editors 2003. Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Data Sets.
7
 
8
Kubat, M. and Matwin, S. 1997. Addressing the curse of imbalanced data set: One sided sampling. In Proc. of the Fourteenth International Conference on Machine Learning, 179--186.
 
9
Chen, X., Gerlach, B., and Casasent, D. 2005. Pruning support vectors for imbalanced data classification. In Proc. of International Joint Conference on Neural Networks, 3, 1883--1888.
 
10
 
11
Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, P. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321--357.
 
12
Estabrooks, A., Jo, T., and Japkowicz, N. 2004. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 20(1), 18--36.
13
 
14
Elkan, C. 2001. The foundations of cost-sensitive learning. Proc. of the Seventeenth International Joint Conference on Artificial Intelligence, 973--978.
 
15
 
16
Huang, K., Yang, H., King, I., Lyu, M., 2004. Learning classifiers from imbalanced data based on biased minimax probability machine. Proc. of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2(27), II-558 - II-563.
 
17
Ting, K. 1994. The problem of small disjuncts: its remedy on decision trees. Proc. of the Tenth Canadian Conference on Artificial Intelligence, 91--97.
 
18
Chawla, N., Lazarevic, A., Hall, L., and Bowyer, K. 2003. SMOTEBoost: Improving prediction of the minority class in boosting. Principles of Knowledge Discovery in Databases, LNAI 2838, 107--119.
 
19
20
 
21
Xiong, H and Chen, X. 2006. Kernel-based distance metric learning for microarray data classification. BMC Bioinformatics, 7, 299.
 
22
 
23
 
24
 
25
 
26
Weston, J., Mukherjee, S., Chapelle, O. Pontil, M. Poggio, T. and Vapnik, V. 2000. Feature selection for support vector machines. In Advances in Neural Information Processing Systems.
27
 
28
 
29
 
30
 
31
32
 
33
 
34
 
35
 
36
McCallum, A. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow.
 
37
Pomeroy, S., Tamayo, P. Gaasenbeek, M., Sturla, L., Angelo, M., McLaughlin, M., Kim, J., Goumnerova, J., Black, P. Lau, C., Allen, J., Zagzag, D., Olson, J., Curran, T., Wetmore, C., Biegel, J., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, D., Mesirov, J., Lander, E. and Golub. T. 2002. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415, 436--442.
 
38
Shipp, M., Ross, K., Tamayo, P., Weng, A. Kutok, J., Aguiar, R., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G., Ray, T., Koval, M., Last, M., Norton, A., Lister, T., Mesirov, T., Neuberg, D., Lander, E., Aster, S., and Golub, T. 2002. Diffuse large b-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nature Medicine, 8, 68--74.
 
39
Petricoin, E., Ardekani, A., Hitt, B., Levine, P., Fusaro, V., Steinberg, S., Mills, G., Simone, C., Fishman, D., Kohn, E. and Liotta, L. 2002. Use of proteomic patterns in serum to identify ovarian cancer. The Lancet, 359, 572--577.
 
40
Petricoin, E., Ornstein, D., Paweletz, C., Ardekani, A., Hackett, P., Hitt, B., Velassco, A., Trucco, C., Wiegand, L., Wood, K., Simone, C., Levine, P., Linehan, W., Emmert-Buck, M., Steinberg, S., Kohn, E. and Liotta, A. 2002. Serum proteomic patterns for detection of prostate cancer. Journal of the National Cancer Institute, 94, 1576--1578.
 
41
Roweis, S. 2008. Personal website. http://www.cs.toronto.edu/ roweis.
 
42
MPS, 2006. Performance predition challenge - evaluation. http://www.modelselect.inf.ethz.ch/evaluation.php.
43

Collaborative Colleagues:
Xue-wen Chen: colleagues
Michael Wasikowski: colleagues