|
ABSTRACT
Predicting the values of continuous variable as a function of several independent variables is one of the most important problems for data mining. A very large number of regression methods, both parametric and nonparametric, have been proposed in the past. However, since the list is quite extensive and many of these models make rather explicit, strong yet different assumptions about the type of applicable problems and involve a lot of parameters and options, choosing the appropriate regression methodology and then specifying the parameter values is a none-trivial, sometimes frustrating, task for data mining practitioners. Choosing the inappropriate methodology can have rather disappointing results. This issue is against the general utility of data mining software. For example,linear regression methods are straightforward and well-understood. However, since the linear assumption is very strong, its performance is compromised for complicated non-linear problems. Kernel-based methods perform quite well if the kernel functions are selected correctly. In this paper, we propose a straightforward approach based on summarizing the training data using an ensemble of random decisions trees. It requires very little knowledge from the user, yet is applicable to every type of regression problem that we are currently aware of. We have experimented on a wide range of problems including those that parametric methods performwell, a large selection of benchmark datasets for nonparametric regression, as well as highly non-linear stochastic problems. Our results are either significantly better than or identical to many approaches that are known to perform well on these problems.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth.
|
| |
4
|
Fan, J. and Huang, L.-S. (2001). Goodness-of-fit tests for parametric regression models. Journal of the American Statistical Association, 96(454):640--664.
|
| |
5
|
|
| |
6
|
|
| |
7
|
Hardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press.
|
| |
8
|
Hastie, T. and Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1:297--318.
|
 |
9
|
|
| |
10
|
Kedem, B. and Fokianos, K. (2002). Regression Models for Time Series Analysis.
|
| |
11
|
Liu, T. F. (July 2005). The utility of randomness in decision tree ensemble. Master's thesis, Faculty of Information Technology, Monash University.
|
| |
12
|
Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12:361--386.
|
| |
13
|
McCullagh, P. and Nelder, J. A. (1989). Generlized linear models, 2nd edition. Chapman and Hall, London.
|
| |
14
|
Nelder, J. A. and Wedderburn, R. W. M. (1972). Generlized linear models. Journal of Royal Statistical Survey. Series A, 135:370--384.
|
| |
15
|
Segal, M. R. (2004). Machine learning benchmarks and random forest regression. available from eScholarship repository, http://repositories.cdlib.org/cbmb/bench_rf_regn/.
|
| |
16
|
Yan, J., Li, S., Zhu, S., and Zhang, H. (2001). Ensemble svm regression based multi-view face detection system. Technical Report MSR-TR-2001-09, Microsoft Research.
|
| |
17
|
|
| |
18
|
Zhou, Z. H., Wu, J. X., Tang, W., and Chen., Z. Q. (2001). Combining regresson estimators: Ga-based selective neural network ensemble. International Journal of Computational Intelligence and Applications, 2001, 1(4):341-356, 1(4):341--356.
|
CITED BY
|
|
Kun-Lung Wu , Kirsten W. Hildrum , Wei Fan , Philip S. Yu , Charu C. Aggarwal , David A. George , Buǧra Gedik , Eric Bouillet , Xiaohui Gu , Gang Luo , Haixun Wang, Challenges and experience in prototyping a multi-modal stream analytic and monitoring application on System S, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|