|
ABSTRACT
In this paper, we propose a general framework for distributed boosting intended for efficient integrating specialized classifiers learned over very large and distributed homogeneous databases that cannot be merged at a single location. Our distributed boosting algorithm can also be used as a parallel classification technique, where a massive database that cannot fit into main computer memory is partitioned into disjoint subsets for a more efficient analysis. In the proposed method, at each boosting round the classifiers are first learned from disjoint datasets and then exchanged amongst the sites. Finally the classifiers are combined into a weighted voting ensemble on each disjoint data set. The ensemble that is applied to an unseen test set represents an ensemble of ensembles built on all distributed sites. In experiments performed on four large data sets the proposed distributed boosting method achieved classification accuracy comparable or even slightly better than the standard boosting algorithm while requiring less memory and less computational time. In addition, the communication overhead of the distributed boosting algorithm is very small making it a viable alternative to the standard boosting for large-scale databases.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Blake, C.L. and Merz, C.J.: UCI Repository of machine learning databases {http://www.ics.uci.edu/-mlearn/MLRepository.html}. Irvine, CA: University of California, Department of Information and Computer Science, (1998).
|
| |
3
|
Chart, P. and Stolfo, S. On the Accuracy of Meta-leaming for Scalable Data Mining, Journal of Intelligent Integration of Information, (Kerschberg L. Ed.), (1998).
|
| |
4
|
Scott H. Clearwater , Tze-Pin Cheng , Haym Hirsh , Bruce G. Buchanan, Incremental batch learning, Proceedings of the sixth international workshop on Machine learning, p.366-370, December 1989, Ithaca, New York, United States
|
 |
5
|
Wei Fan , Salvatore J. Stolfo , Junxin Zhang, The application of AdaBoost for distributed, scalable and on-line learning, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.362-366, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312283]
|
| |
6
|
Freund, Y., and Schapire, R. E. Experiments with a New Boosting Algorithm, in Proceedings of the 13th International Conference on Machine Learning, (1996), 325-332.
|
| |
7
|
|
| |
8
|
Hagan, M., Menhaj, M.B. Training Feedforward Networks with the Marquardt Algorithm. IEEE Transactions on Neural Networks (1994), 5, 989-993.
|
| |
9
|
Lazarevic, A., Obradovic, Z. The Effective Pruning of Neural Network Ensembles, in Proceedings of the IEEE International Joint Conference on Neural Networks, (2001), in press.
|
| |
10
|
Pokrajac D., Fiez T., Obradovic Z. A Spatial Data Simulator for Agriculture Knowledge Discovery Applications, in review.
|
| |
11
|
Riedmiller, M., Braun, H. A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm, in Proceedings of the IEEE International Conference on Neural Networks, (1993), 586-591.
|
| |
12
|
Utgoff, P. An Improved Algorithm for Incremental Induction of Decision Trees, in Proceedings of the l lth International Conference on Machine Learning, (1994), 318-325.
|
CITED BY 4
|
|
|
|
|
Ping Luo , Hui Xiong , Kevin Lü , Zhongzhi Shi, Distributed classification in peer-to-peer networks, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|