ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Learning Ensembles from Bites: A Scalable and Accurate Approach
Full text PdfPdf (3.34 MB)
Source The Journal of Machine Learning Research archive
Volume 5 ,  (December 2004) table of contents
Pages: 421 - 451  
Year of Publication: 2004
ISSN:1532-4435
Authors
Publisher
Bibliometrics
Downloads (6 Weeks): 1,   Downloads (12 Months): 45,   Citation Count: 12
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  

Warning: The download time has expired please click on the item to try again.


ABSTRACT

Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive data sets, because the size of the data set can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") is a promising approach for learning from massive data sets, one that can utilize the power of boosting and bagging. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
<i>Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</i>, San Francisco, CA, 2001. ACM.
 
2
R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. A new ensemble diversity measure applied to thinning ensembles. In <i>Multiple Classifier Systems Workshop</i>, pages 306-316, Surrey, UK, 2003.
 
3
 
4
H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. <i>Nucleic Acids Research</i>, 28:235-242, 2000. http://www.pdb.org/.
 
5
C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, 1998.
 
6
 
7
 
8
 
9
P. Chan and S. Stolfo. Towards parallel and distributed learning by meta-learning. In <i>Working Notes AAAI Workshop on Knowledge Discovery and Databases</i>, pages 227-240, San Mateo, CA, 1993.
 
10
N. V. Chawla, S. Eschrich, and L. O. Hall. Creating ensembles of classifiers. In <i>First IEEE International Conference on Data Mining</i>, pages 581-583, San Jose, CA, 2000.
 
11
 
12
 
13
N. V. Chawla, T. E. Moore, Jr., L. O. Hall, K. W. Bowyer, W. P. Kegelmeyer, and C. Springer. Investigation of bagging-like effects and decision trees versus neural nets in protein secondary structure prediction. In <i>ACM SIGKDD Workshop on Data Mining in Bio-Informatics</i>, San Francisco, CA, 2001.
 
14
 
15
 
16
P. Domingos. Using partitioning to speed up specific-to-general rule induction. In <i>AAAI Workshop on Integrating Multiple Learned Models</i>, pages 29-34, Portland, OR, 1996.
 
17
 
18
S. Eschrich, N. V. Chawla, and L. O. Hall. Learning to predict in complex biological domains. <i>Journal of System Simulation</i>, 2:1464-1471, 2002.
 
19
 
20
 
21
Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In <i>Thirteenth International Conference on Machine Learning</i>, Bari, Italy, 1996.
 
22
 
23
I. J. Good. <i>The Estimation of Probabilities: An essay on modern Bayesian methods</i>. MIT Press, 1965.
 
24
L. O. Hall, K. W. Bowyer, N. V. Chawla, T. E. Moore, and W. P. Kegelmeyer. Avatar: Adaptive Visualization Aid for Touring and Recovery. Technical Report SAND2000-8203, Sandia National Labs, 2000.
 
25
 
26
 
27
D. T. Jones. Protein secondary structure prediction based on decision-specific scoring matrices. <i>Journal of Molecular Biology</i>, 292:195-202, 1999.
 
28
 
29
L. Kuncheva, C. Whitaker, C. Shipp, and R. Duin. Is independence good for combining classifiers? In <i>Proceedings of 15th International Conference on Pattern Recognition</i>, pages 168-171, Barcelona, Spain, September 2000.
 
30
 
31
 
32
 
33
Lawrence Livermore National Laboratories. ASCI Blue Pacific. http://www.llnl.gov/asci/platforms/bluepac.
 
34
Lawrence Livermore National Laboratories. Protein Structure Prediction Center. http://predictioncenter.llnl.gov/, 1999.
 
35
R. Musick, J. Catlett, and S. Russell. Decision theoretic subsampling for induction on large databases. In <i>Proceedings of Tenth International Conference on Machine Learning</i>, pages 212- 219, Amherst, MA, 1993.
 
36
 
37
F. Provost and D. N. Hennessy. Scaling up: Distributed machine learning with cooperation. In <i>Proceedings of the Thirteenth National Conference on Artificial Intelligence, AAAI'96</i>, pages 74-79, Portland, Oregon, 1996.
 
38
39
 
40
 
41
D. B. Skalak. The sources of increased accuracy for two proposed boosting algorithms. In <i>AAAI Integrating Multiple Learned Models Workshop</i>, Portland, Oregon, 1996.
42

CITED BY  12

Collaborative Colleagues:
Nitesh V. Chawla: colleagues
Lawrence O. Hall: colleagues
Kevin W. Bowyer: colleagues
W. Philip Kegelmeyer: colleagues