|
ABSTRACT
Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree.
We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the “real” tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost.
Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely rebuilding the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
AD97
|
A.C.Davison and D.V.Hinkley. Bootstrap Methods and their Applications. Cambridge Series in Statistical and Probabilistie Mathematics. Cambridge University Press, 1997.
|
| |
AGI+92
|
|
| |
AIS93
|
|
| |
BFOS84
|
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984.
|
| |
ET93
|
B. Eft'on and R. J. Tibshirani. An introduction to the bootstrap. Chapman & Hall, 1993.
|
| |
FMM96
|
|
 |
FMMT96a
|
Takeshi Fukuda , Yasukiko Morimoto , Shinichi Morishita , Takeshi Tokuyama, Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.13-23, June 04-06, 1996, Montreal, Quebec, Canada
|
 |
FMMT96b
|
Takeshi Fukuda , Yasuhido Morimoto , Shinichi Morishita , Takeshi Tokuyama, Mining optimized association rules for numeric attributes, Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.182-191, June 04-06, 1996, Montreal, Quebec, Canada
[doi> 10.1145/237661.237708]
|
| |
FPSSU96
|
|
| |
GRG98
|
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainfor~:st - A framework for fast decision tree construction of large datasets. VLDB 1996.
|
| |
Han97
|
D.J. Hand. Construction and Assessment of Classification Rules. John Wiley & Sons, Chichester, England, 1997.
|
| |
LLS97
|
Tjen-Sien Lira, Wei-Y'm Loll, and Yu-Shan Shih. An empirical comparison of decision trees and other classification methods. Technical Report 979, Department of Statistics, University of Wisconsin, Madison, June 1997.
|
| |
LS97
|
Wei-Y'm Loh and Yu-Shan Shih. Split selection meth(~s for classification trees. Statistica Sinica, 7(4), October 1997.
|
| |
Man94
|
O.L. Mangasarian. Nonlinear Programming. Classics in Applied Mathematics. Society for industrial and Applied Mathematics, 1994.
|
| |
MAR96
|
|
| |
MFM+98
|
|
| |
MS98
|
N. Megiddo and R. Srikant. Discovering predictive association rules. KDD 1998.
|
| |
MST94
|
|
| |
Mur95
|
|
| |
Olk93
|
E Olken. Random Sampling from Databases. PhD lthesis, University of California at Berkeley, 1993.
|
| |
Qui86
|
|
| |
RS98
|
|
| |
SAM96
|
|
| |
UBC97
|
|
| |
Utg89
|
|
CITED BY 50
|
|
Yiming Ma , Bing Liu , Ching Kian Wong , Philip S. Yu , Shuik Ming Lee, Targeting the right students using data mining, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.457-464, August 20-23, 2000, Boston, Massachusetts, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Johannes Gehrke , Wie-Yin Loh , Raghu Ramakrishnan, Classification and regression: money *can* grow on trees, Tutorial notes of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.1-73, August 15-18, 1999, San Diego, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
Bing Liu , Yiyuan Xia , Philip S. Yu, Clustering through decision tree construction, Proceedings of the ninth international conference on Information and knowledge management, p.20-29, November 06-11, 2000, McLean, Virginia, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Minos N. Garofalakis , Rajeev Rastogi , S. Seshadri , Kyuseok Shim, Data mining and the Web: past, present and future, Proceedings of the 2nd international workshop on Web information and data management, p.43-47, November 02-06, 1999, Kansas City, Missouri, United States
|
|
|
Minos Garofalakis , Dongjoon Hyun , Rajeev Rastogi , Kyuseok Shim, Efficient algorithms for constructing decision trees with constraints, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.335-339, August 20-23, 2000, Boston, Massachusetts, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Haixun Wang , Jian Yin , Jian Pei , Philip S. Yu , Jeffrey Xu Yu, Suppressing model overfitting in mining concept-drifting data streams, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Wei Fan , Haixun Wang , Philip S. Yu , Shaw-Hwa Lo, Inductive learning in less than one sequential data scan, Proceedings of the 18th international joint conference on Artificial intelligence, p.595-600, August 09-15, 2003, Acapulco, Mexico
|
|