ACM Home Page
Please provide us with feedback. Feedback
BOAT—optimistic decision tree construction
Full text PdfPdf (1.70 MB)
Source International Conference on Management of Data archive
Proceedings of the 1999 ACM SIGMOD international conference on Management of data table of contents
Philadelphia, Pennsylvania, United States
Pages: 169 - 180  
Year of Publication: 1999
ISBN:1-58113-084-8
Also published in ...
Authors
Johannes Gehrke  Department of Computer Sciences and Department of Statistics, University of Wisconsin-Madison
Venkatesh Ganti  Department of Computer Sciences and Department of Statistics, University of Wisconsin-Madison
Raghu Ramakrishnan  Department of Computer Sciences and Department of Statistics, University of Wisconsin-Madison
Wei-Yin Loh  Department of Computer Sciences and Department of Statistics, University of Wisconsin-Madison
Sponsors
SIGART: ACM Special Interest Group on Artificial Intelligence
SIGMOD: ACM Special Interest Group on Management of Data
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 85,   Citation Count: 50
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/304182.304197
What is a DOI?

ABSTRACT

Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree. We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the “real” tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost. Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely rebuilding the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
AD97
A.C.Davison and D.V.Hinkley. Bootstrap Methods and their Applications. Cambridge Series in Statistical and Probabilistie Mathematics. Cambridge University Press, 1997.
 
AGI+92
 
AIS93
 
BFOS84
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984.
 
ET93
B. Eft'on and R. J. Tibshirani. An introduction to the bootstrap. Chapman & Hall, 1993.
 
FMM96
FMMT96a
FMMT96b
 
FPSSU96
 
GRG98
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainfor~:st - A framework for fast decision tree construction of large datasets. VLDB 1996.
 
Han97
D.J. Hand. Construction and Assessment of Classification Rules. John Wiley & Sons, Chichester, England, 1997.
 
LLS97
Tjen-Sien Lira, Wei-Y'm Loll, and Yu-Shan Shih. An empirical comparison of decision trees and other classification methods. Technical Report 979, Department of Statistics, University of Wisconsin, Madison, June 1997.
 
LS97
Wei-Y'm Loh and Yu-Shan Shih. Split selection meth(~s for classification trees. Statistica Sinica, 7(4), October 1997.
 
Man94
O.L. Mangasarian. Nonlinear Programming. Classics in Applied Mathematics. Society for industrial and Applied Mathematics, 1994.
 
MAR96
 
MFM+98
 
MS98
N. Megiddo and R. Srikant. Discovering predictive association rules. KDD 1998.
 
MST94
 
Mur95
 
Olk93
E Olken. Random Sampling from Databases. PhD lthesis, University of California at Berkeley, 1993.
 
Qui86
 
RS98
 
SAM96
 
UBC97
 
Utg89

CITED BY  50

Collaborative Colleagues:
Johannes Gehrke: colleagues
Venkatesh Ganti: colleagues
Raghu Ramakrishnan: colleagues
Wei-Yin Loh: colleagues