ACM Home Page
Please provide us with feedback. Feedback
A general framework for accurate and fast regression by data summarization in random decision trees
Full text PdfPdf (4.67 MB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Philadelphia, PA, USA
SESSION: Research track papers table of contents
Pages: 136 - 146  
Year of Publication: 2006
ISBN:1-59593-339-5
Authors
Wei Fan  IBM T. J. Watson Research, Hawthorne, NY
Joe McCloskey  US Department of Defense, Ft. Meade, MD
Philip S. Yu  IBM T. J. Watson Research, Hawthorne, NY
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 93,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1150402.1150421
What is a DOI?

ABSTRACT

Predicting the values of continuous variable as a function of several independent variables is one of the most important problems for data mining. A very large number of regression methods, both parametric and nonparametric, have been proposed in the past. However, since the list is quite extensive and many of these models make rather explicit, strong yet different assumptions about the type of applicable problems and involve a lot of parameters and options, choosing the appropriate regression methodology and then specifying the parameter values is a none-trivial, sometimes frustrating, task for data mining practitioners. Choosing the inappropriate methodology can have rather disappointing results. This issue is against the general utility of data mining software. For example,linear regression methods are straightforward and well-understood. However, since the linear assumption is very strong, its performance is compromised for complicated non-linear problems. Kernel-based methods perform quite well if the kernel functions are selected correctly. In this paper, we propose a straightforward approach based on summarizing the training data using an ensemble of random decisions trees. It requires very little knowledge from the user, yet is applicable to every type of regression problem that we are currently aware of. We have experimented on a wide range of problems including those that parametric methods performwell, a large selection of benchmark datasets for nonparametric regression, as well as highly non-linear stochastic problems. Our results are either significantly better than or identical to many approaches that are known to perform well on these problems.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth.
 
4
Fan, J. and Huang, L.-S. (2001). Goodness-of-fit tests for parametric regression models. Journal of the American Statistical Association, 96(454):640--664.
 
5
 
6
 
7
Hardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press.
 
8
Hastie, T. and Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1:297--318.
9
 
10
Kedem, B. and Fokianos, K. (2002). Regression Models for Time Series Analysis.
 
11
Liu, T. F. (July 2005). The utility of randomness in decision tree ensemble. Master's thesis, Faculty of Information Technology, Monash University.
 
12
Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12:361--386.
 
13
McCullagh, P. and Nelder, J. A. (1989). Generlized linear models, 2nd edition. Chapman and Hall, London.
 
14
Nelder, J. A. and Wedderburn, R. W. M. (1972). Generlized linear models. Journal of Royal Statistical Survey. Series A, 135:370--384.
 
15
Segal, M. R. (2004). Machine learning benchmarks and random forest regression. available from eScholarship repository, http://repositories.cdlib.org/cbmb/bench_rf_regn/.
 
16
Yan, J., Li, S., Zhu, S., and Zhang, H. (2001). Ensemble svm regression based multi-view face detection system. Technical Report MSR-TR-2001-09, Microsoft Research.
 
17
 
18
Zhou, Z. H., Wu, J. X., Tang, W., and Chen., Z. Q. (2001). Combining regresson estimators: Ga-based selective neural network ensemble. International Journal of Computational Intelligence and Applications, 2001, 1(4):341-356, 1(4):341--356.


Collaborative Colleagues:
Wei Fan: colleagues
Joe McCloskey: colleagues
Philip S. Yu: colleagues