|
ABSTRACT
Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. Traditional machine learning algorithms may be biased towards the majority class, thus producing poor predictive accuracy over the minority class. In this paper, we describe a new approach that combines boosting, an ensemble-based learning algorithm, with data generation to improve the predictive power of classifiers against imbalanced data sets consisting of two classes. In the DataBoost-IM method, hard examples from both the majority and minority classes are identified during execution of the boosting algorithm. Subsequently, the hard examples are used to separately generate synthetic examples for the majority and minority classes. The synthetic data are then added to the original training set, and the class distribution and the total weights of the different classes in the new training set are rebalanced. The DataBoost-IM method was evaluated, in terms of the F-measures, G-mean and overall accuracy, against seventeen highly and moderately imbalanced data sets using decision trees as base classifiers. Our results are promising and show that the DataBoost-IM method compares well in comparison with a base classifier, a standard benchmarking boosting algorithm and three advanced boosting-based algorithms for imbalanced data set. Results indicate that our approach does not sacrifice one class in favor of the other, but produces high predictions against both minority and majority classes.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
N. Japkowicz. Learning from imbalanced data sets: A comparison of various strategies, Learning from imbalanced data sets: The AAAI Workshop 10-15. Menlo Park, CA: AAAI Press. Technical Report WS-00-05, 2000.
|
| |
2
|
N. Chawla, K. Bowyer, L. Hall and W. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321--357, 2002.
|
| |
3
|
M. A. Maloof. Learning when data sets are Imbalanced and when costs are unequal and unknown, ICML-2003 Workshop on Learning from Imbalanced Data Sets II, 2003.
|
| |
4
|
|
| |
5
|
M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. Proceedings of the Fourteenth International Conference on Machine Learning San Francisco, CA, Morgan Kaufmann, 179--186, 1997.
|
| |
6
|
M. Joshi, V. Kumar and R. Agarwal. Evaluating boosting algorithms to classify rare classes: comparison and improvements. Technical Report RC-22147, IBM Research Division, 2001.
|
| |
7
|
N. Chawla, A. Lazarevic, L. Hall and K. Bowyer. SMOTEBoost: improving prediction of the minority class in boosting. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 107--119, 2003.
|
| |
8
|
Y. Freund and R. Schapire. Experiments with a new boosting algorithm. the Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 148--156, 1996
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
C. L. Blake and C. J. Merz. UCI Repository of Machine Learning Databases {http://www.ics.uci.edu/~mlearn/MLRepository.html}. Department of Information and Computer Science, University of California, Irvine, CA, 1998.
|
| |
13
|
H. Guo and HL Viktor. Boosting with data generation: Improving the Classification of Hard to Learn Examples, to be presented at the 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE). Ottawa, Canada, May 17--20, 2004.
|
| |
14
|
|
| |
15
|
|
| |
16
|
F. Provost and T. Fawcett. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In proceedings of the Third international conference on Knowledge discovery and data mining, Menlo park, CS. AAAI Press, 43--48, 1997.
|
| |
17
|
HL Viktor. The CILT multi-agent learning system, South African Computer Journal (SACJ), 24, 171--181, 1999.
|
| |
18
|
HL Viktor and I. Skrypnik. Improving the Competency of Ensembles of Classifiers through Data Generation, ICANNGA'2001, Prague: Czech Republic, April 21--25, 59--62, 2001.
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
C. Drummond and R. Holte. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, Workshop on Learning from Imbalanced Data sets II held in conjunction with ICML'2003, 2003.
|
| |
24
|
H. L. Viktor and H. Guo, Multiple Classifier Prediction Improvements against Imbalanced Datasets through Added Synthetic Examples, to be presented at the10th International Workshop on Statistical Pattern Recognition, Lisbon, Portugal, August 18--20, 2004.
|
CITED BY 17
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yuchun Tang , Yan-Qing Zhang , Nitesh V. Chawla , Sven Krasser, SVMs modeling for highly imbalanced classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, v.39 n.1, p.281-288, February 2009
|
|
|
Xu-Ying Liu , Jianxin Wu , Zhi-Hua Zhou, Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, v.39 n.2, p.539-550, April 2009
|
|