|
ABSTRACT
We present a comprehensive suite of experimentation on the subject of learning from imbalanced data. When classes are imbalanced, many learning algorithms can suffer from the perspective of reduced performance. Can data sampling be used to improve the performance of learners built from imbalanced data? Is the effectiveness of sampling related to the type of learner? Do the results change if the objective is to optimize different performance metrics? We address these and other issues in this work, showing that sampling in many cases will improve classifier performance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Barandela, R., Valdovinos, R. M., Sanchez, J. S., & Ferri, F. J. (2004). The imbalanced training sample problem: Under or over sampling? In Joint IAPR International Workshops on Structural, Syntactic, and Statistical Pattern Recognition (SSPR/SPR'04), Lecture Notes in Computer Science 3138, 806--814.
|
| |
2
|
Berenson, M. L., Levine, D. M., & Goldstein, M. (1983). Intermediate statistical methods and applications: A computer package approach. Prentice-Hall, Inc.
|
| |
3
|
Blake, C., & Merz, C. (1998). UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html. Department of Information and Computer Sciences, University of California, Irvine.
|
| |
4
|
|
| |
5
|
Chawla, N. V., Hall, L. O., Bowyer, K. W., & Kegelmeyer, W. P. (2002). Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 321--357.
|
| |
6
|
Drummond, C., & Holte, R. C. (2003). C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning.
|
| |
7
|
Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderlinesmote: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (ICIC'05). Lecture Notes in Computer Science 3644 (pp. 878--887). Springer-Verlag.
|
| |
8
|
Hand, D. J. (2005). Good practice in retail credit scorecard assessment. Journal of the Operational Research Society, 56, 1109--1117.
|
| |
9
|
Japkowicz, N. (2000). Learning from imbalanced data sets: a comparison of various strategies. AAAI Workshop on Learning from Imbalanced Data Sets (AAAI'00) (pp. 10--15).
|
 |
10
|
|
| |
11
|
Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: One sided selection. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 179--186). Morgan Kaufmann.
|
| |
12
|
Maloof, M. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets.
|
| |
13
|
Monard, M. C., & Batista, G. E. A. P. A. (2002). Learning with skewed class distributions. Advances in Logic, Artificial Intelligence and Robotics (LAPTEC'02) (pp. 173--180).
|
| |
14
|
|
| |
15
|
SAS Institute (2004). SAS/STAT user's guide. SAS Institute Inc.
|
| |
16
|
Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 315--354.
|
| |
17
|
|
| |
18
|
Claes Wohlin , Per Runeson , Martin Höst , Magnus C. Ohlsson , Bjöorn Regnell , Anders Wesslén, Experimentation in software engineering: an introduction, Kluwer Academic Publishers, Norwell, MA, 2000
|
CITED BY 6
|
|
|
|
|
|
|
|
V. García , J. S. Sánchez , R. A. Mollineda, On the use of surrounding neighbors for synthetic over-sampling of the minority class, Proceedings of the 8th conference on Simulation, modelling and optimization, p.389-394, September 23-25, 2008, Santander, Cantabria, Spain
|
|
|
|
|
|
Dennis J. Drown , Taghi M. Khoshgoftaar , Naeem Seliya, Evolutionary sampling and software quality modeling of high-assurance systems, IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, v.39 n.5, p.1097-1107, September 2009
|
|
|
|
|