ACM Home Page
Please provide us with feedback. Feedback
Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach
Full text PdfPdf (769 KB)
Source ACM SIGKDD Explorations Newsletter archive
Volume 6 ,  Issue 1  (June 2004) table of contents
Special issue on learning from imbalanced datasets
SPECIAL ISSUE: Special issue on learning from imbalanced datasets table of contents
Pages: 30 - 39  
Year of Publication: 2004
ISSN:1931-0145
Authors
Hongyu Guo  University of Ottawa, Ottawa, Ontario, Canada
Herna L. Viktor  University of Ottawa, Ottawa, Ontario, Canada
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 155,   Citation Count: 14
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1007730.1007736
What is a DOI?

ABSTRACT

Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. Traditional machine learning algorithms may be biased towards the majority class, thus producing poor predictive accuracy over the minority class. In this paper, we describe a new approach that combines boosting, an ensemble-based learning algorithm, with data generation to improve the predictive power of classifiers against imbalanced data sets consisting of two classes. In the DataBoost-IM method, hard examples from both the majority and minority classes are identified during execution of the boosting algorithm. Subsequently, the hard examples are used to separately generate synthetic examples for the majority and minority classes. The synthetic data are then added to the original training set, and the class distribution and the total weights of the different classes in the new training set are rebalanced. The DataBoost-IM method was evaluated, in terms of the F-measures, G-mean and overall accuracy, against seventeen highly and moderately imbalanced data sets using decision trees as base classifiers. Our results are promising and show that the DataBoost-IM method compares well in comparison with a base classifier, a standard benchmarking boosting algorithm and three advanced boosting-based algorithms for imbalanced data set. Results indicate that our approach does not sacrifice one class in favor of the other, but produces high predictions against both minority and majority classes.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
N. Japkowicz. Learning from imbalanced data sets: A comparison of various strategies, Learning from imbalanced data sets: The AAAI Workshop 10-15. Menlo Park, CA: AAAI Press. Technical Report WS-00-05, 2000.
 
2
N. Chawla, K. Bowyer, L. Hall and W. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321--357, 2002.
 
3
M. A. Maloof. Learning when data sets are Imbalanced and when costs are unequal and unknown, ICML-2003 Workshop on Learning from Imbalanced Data Sets II, 2003.
 
4
 
5
M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. Proceedings of the Fourteenth International Conference on Machine Learning San Francisco, CA, Morgan Kaufmann, 179--186, 1997.
 
6
M. Joshi, V. Kumar and R. Agarwal. Evaluating boosting algorithms to classify rare classes: comparison and improvements. Technical Report RC-22147, IBM Research Division, 2001.
 
7
N. Chawla, A. Lazarevic, L. Hall and K. Bowyer. SMOTEBoost: improving prediction of the minority class in boosting. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 107--119, 2003.
 
8
Y. Freund and R. Schapire. Experiments with a new boosting algorithm. the Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 148--156, 1996
 
9
 
10
 
11
 
12
C. L. Blake and C. J. Merz. UCI Repository of Machine Learning Databases {http://www.ics.uci.edu/~mlearn/MLRepository.html}. Department of Information and Computer Science, University of California, Irvine, CA, 1998.
 
13
H. Guo and HL Viktor. Boosting with data generation: Improving the Classification of Hard to Learn Examples, to be presented at the 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE). Ottawa, Canada, May 17--20, 2004.
 
14
 
15
 
16
F. Provost and T. Fawcett. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In proceedings of the Third international conference on Knowledge discovery and data mining, Menlo park, CS. AAAI Press, 43--48, 1997.
 
17
HL Viktor. The CILT multi-agent learning system, South African Computer Journal (SACJ), 24, 171--181, 1999.
 
18
HL Viktor and I. Skrypnik. Improving the Competency of Ensembles of Classifiers through Data Generation, ICANNGA'2001, Prague: Czech Republic, April 21--25, 59--62, 2001.
 
19
 
20
 
21
 
22
 
23
C. Drummond and R. Holte. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, Workshop on Learning from Imbalanced Data sets II held in conjunction with ICML'2003, 2003.
 
24
H. L. Viktor and H. Guo, Multiple Classifier Prediction Improvements against Imbalanced Datasets through Added Synthetic Examples, to be presented at the10th International Workshop on Statistical Pattern Recognition, Lisbon, Portugal, August 18--20, 2004.

CITED BY  17

Collaborative Colleagues:
Hongyu Guo: colleagues
Herna L. Viktor: colleagues