ACM Home Page
Please provide us with feedback. Feedback
Wrapper-based computation and evaluation of sampling methods for imbalanced datasets
Full text PdfPdf (183 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 1st international workshop on Utility-based data mining table of contents
Chicago, Illinois
Pages: 24 - 33  
Year of Publication: 2005
ISBN:1-59593-208-9
Authors
Nitesh V. Chawla  University of Notre Dame, Notre Dame, IN
Lawrence O. Hall  University of South Florida, Tampa, FL
Ajay Joshi  University of South Florida, Tampa, FL
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 58,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1089827.1089830
What is a DOI?

ABSTRACT

Learning from imbalanced datasets presents an interesting problem both from modeling and economy standpoints. When the imbalance is large, classification accuracy on the smaller class(es) tends to be lower. In particular, when a class is of great interest but occurs relatively rarely such as cases of fraud, instances of disease, and regions of interest in largescale simulations, it is important to accurately identify it. It then becomes more costly to misclassify the interesting class. In this paper, we implement a wrapper approach that computes the amount of under-sampling and synthetic generation of the minority class examples (SMOTE) to improve minority class accuracy. The f-value serves as the evaluation function. Experimental results show the wrapper approach is effective in optimization of the composite f-value, and reduces the average cost per test example for the datasets considered. We report both average cost per test example and the cost curves in the paper. The true positive rate of the minority class increases significantly without causing a significant change in the f-value. We also obtain the lowest cost per test example, compared to any result we are aware of for the KDD Cup-99 intrusion detection data set.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
In T. Dietterich, D. Margineantu, F. Provost, and P. Turney, editors, Proceedings of the ICML'2000 Workshop on COST-SENSITIVE LEARNING. 2003.
 
2
In N. V. Chawla, N. Japkowicz, and A. Kolcz, editors, Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Data Sets. 2003.
 
3
In N. V. Chawla, N. Japkowicz, and A. Kolcz, editors, SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets. SIGKDD, 2004.
 
4
In C. Ferri, P. Flach, J. Orallo, and N. Lachice, editors, ECAI' 2004. First Workshop on ROC Analysis in AI. ECAI, 2004.
5
 
6
R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. Ensembles of Classifiers from Spatially Disjoint Data. In Proceedings of the Sixth International Conference on Multiple Classifier Systems, 2005.
7
 
8
C. Blake and C. Merz. UCI Repository of Machine Learning Databases. Department of Information and Computer Sciences, University of California, Irvine, 1998.
 
9
K. W. Bowyer, L. O. Hall, N. V. Chawla, and T. E. Moore. A parallel Decision Tree Builder for Mining Very Large Visualization Datasets. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 2000.
 
10
 
11
 
12
N. V. Chawla. C4.5 and imbalanced datasets: Investigating the effect of ampling method, probabilistic estimate, and decision tree structure. In Proceedings of the ICML'03 Workshop on Class Imbalances, 2003.
13
 
14
N. V. Chawla, L. O. Hall, B. K. W., and W. P. Kegelmeyer. SMOTE: Synthetic Minority Oversampling TEchnique. Journal of Artificial Intelligence Research, 16:321--357, 2002.
 
15
W. W. Cohen. Fast Effective Rule Induction. In Proc. 12th International Conference on Machine Learning, pages 115--123, Lake Tahoe, CA, 1995. Morgan Kaufmann.
16
 
17
C. Drummond and R. Holte. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, 2003.
 
18
C. Elkan. Results of the kdd'99 Classifier Learning Contest. http://www.cse.ucsd.edu/~elkan/clresults.html, 1999.
 
19
J. Ezawa, K., M. Singh, and W. Norton, S. Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management. In Proceedings of the International Conference on Machine Learning, ICML-96, pages 139--147, Bari, Italy, 1996. Morgan Kauffman.
 
20
S. Hettich and S. D. Bay. The UCI KDD Archive {http://kdd.ics.uci.edu}. Department of Information and Computer Sciences, University of California, Irvine, 1998.
 
21
N. Japkowicz. The Class Imbalance Problem: Significance and Strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI'2000): Special Track on Inductive Learning, Las Vegas, Nevada, 2000.
 
22
N. Japkowicz and S. Stephen. The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis, 6(5):203--231, 2002.
 
23
 
24
 
25
M. Kubat and S. Matwin. Addressing the Curse of Imbalanced Training Sets: One Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179--186, Nashville, Tennesse, 1997. Morgan Kaufmann.
 
26
 
27
C. Ling and C. Li. Data Mining for Direct Marketing Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York, NY, 1998. AAAI Press.
 
28
M. Maloof. Learning when data sets are imbalanced and when costs are unequal and unknown. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, 2003.
 
29
F. Provost and T. Fawcett. Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 43--48, New Port Beach, CA, 1997. AAAI Press.
 
30
 
31
M. R. Sabhnani and G. Serpen. Application of Machine Learning Algorithms to KDD Intrusion Detection Dataset with Misuse Detection Context. In Proceedings of the International Conference on Machine Learning: Models, Technologies, and Applications, pages 209--215, 2003.
 
32
G. Weiss and F. Provost. Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19:315--354, 2003.
 
33
K. Woods, C. Doss, K. Bowyer, J. Solka, C. Priebe, and P. Kegelmeyer. Comparative Evaluation of Pattern Recognition Techniques for Detection of Microcalcifications in Mammography. International Journal of Pattern Recognition and Artificial Intelligence, 7(6):1417--1436, 1993.
34


Collaborative Colleagues:
Nitesh V. Chawla: colleagues
Lawrence O. Hall: colleagues
Ajay Joshi: colleagues