|
ABSTRACT
Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Agrawal and G. Psaila. Active data mining. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 3-8, Montreal, Canada, 1995. AAAI Press.
|
 |
2
|
Necip Fazil Ayan , Abdullah Uz Tansel , Erol Arkun, An efficient algorithm to update large itemsets with early pruning, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.287-291, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312252]
|
| |
3
|
|
| |
4
|
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.
|
| |
5
|
J. Catlett. Megainduction: Machine Learning on Very Large Databases. PhD thesis, Basser Department of Computer Science, University of Sydney, Sydney, Australia, 1991.
|
| |
6
|
|
| |
7
|
|
| |
8
|
A. Danyluk, T. Fawcett, and F. Provost, editors. Proceedings of the AAAI-98 Workshop onPredicting the Future: AI Approaches to Time-Series Analysis. AAAI Press, Madison, WI, 1998.
|
 |
9
|
|
 |
10
|
|
| |
11
|
|
 |
12
|
Johannes Gehrke , Venkatesh Ganti , Raghu Ramakrishnan , Wei-Yin Loh, BOAT—optimistic decision tree construction, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.169-180, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
13
|
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13-30, 1963.
|
 |
14
|
Mark G. Kelly , David J. Hand , Niall M. Adams, The impact of changing populations on classifier performance, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.367-371, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312285]
|
| |
15
|
M. Kubat and G. Widmer, editors. Proceedings of the ICML-96 Workshop on Learning in Context-Sensitive Domains. Bari, Italy, 1996.
|
| |
16
|
|
| |
17
|
|
| |
18
|
R. G. Miller, Jr. Simultaneous Statistical Inference. Springer, New York, NY, 2nd edition, 1981.
|
| |
19
|
R. Musick, J. Catlett, and S. Russell. Decision theoretic subsampling for induction on large databases. In Proceedings of the Tenth International Conference on Machine Learning, pages 212-219, Amherst, MA, 1993. Morgan Kaufmann.
|
| |
20
|
|
| |
21
|
M. Salganicoff. Density-adaptive learning and forgetting. In Proceedings of the Tenth International Conference on Machine Learning, pages 276-283, Amherst, MA, 1993. Morgan Kaufmann.
|
| |
22
|
|
| |
23
|
J. C. Schlimmer and R. H. Granger, Jr. Beyond incremental processing: Tracking concept drift. In Proceedings of the Fifth National Conference on Artificial Intelligence, pages 502-507, Philadelphia, PA, 1986. Morgan Kaufmann.
|
| |
24
|
|
| |
25
|
P. Turney. Context-sensitive learning bibliography. Online bibliography, Institute for Information Technology of the National Research Council of Canada, Ottawa, Canada, 1998. http://ai.iit.nrc.ca/- bibliographies/eontext-sensitive.html.
|
| |
26
|
S. Vijayakumar and S. Schaal, editors. Proceedings of the NIPS-2000 Workshop on Real-Time Modeling for Complex Learning Tasks. NIPS Foundation, Breckenridge, Colorado, 2000.
|
| |
27
|
|
| |
28
|
G. Widmer and M. Kubat. Special issue on context sensitivity and concept drift: Machine Learning, 32(2), 1998.
|
| |
29
|
A. Wolman, G. Voelker, N. Sharma, N. Cardweli, M. Brown, T. Landray, D. Pinnel, A. Karlin, and H. Levy. Organization-based analysis of Web-object sharing and caching. In Proceedings of the Second USENIX Conference on Internet Technologies and Systems, pages 25-36, Boulder, CO, 1999.
|
CITED BY 101
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Brain Babcock , Mayur Datar , Rajeev Motwani , Liadan O'Callaghan, Maintaining variance and k-medians over data stream windows, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.234-243, June 09-11, 2003, San Diego, California
|
|
|
M. Otey , S. Parthasarathy , A. Ghoting , G. Li , S. Narravula , D. Panda, Towards NIC-based intrusion detection, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2003, Washington, D.C.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Zhiyuan Chen , Chen Li , Jian Pei , Yufei Tao , Haixun Wang , Wei Wang , Jiong Yang , Jun Yang , Donghui Zhang, Recent progress on selected topics in database research: a report by nine young Chinese researchers working in the United States, Journal of Computer Science and Technology, v.18 n.5, p.538-552, September 2003
|
|
|
Mohamed Medhat Gaber , Shonali Krishnaswamy , Arkady Zaslavsky, Cost-efficient mining techniques for data streams, Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation, p.109-114, January 01, 2004, Dunedin, New Zealand
|
|
|
Nilesh Dalvi , Pedro Domingos , Mausam , Sumit Sanghai , Deepak Verma, Adversarial classification, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Charu C. Aggarwal , Jiawei Han , Jianyong Wang , Philip S. Yu, On demand classification of data streams, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jiawei Han , Yixin Chen , Guozhu Dong , Jian Pei , Benjamin W. Wah , Jianyong Wang , Y. Dora Cai, Stream Cube: An Architecture for Multi-Dimensional Analysis of Data Streams, Distributed and Parallel Databases, v.18 n.2, p.173-197, September 2005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Haixun Wang , Jian Yin , Jian Pei , Philip S. Yu , Jeffrey Xu Yu, Suppressing model overfitting in mining concept-drifting data streams, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yixin Chen , Guozhu Dong , Jiawei Han , Benjamin W. Wah , Jianyong Wang, Multi-dimensional regression analysis of time-series data streams, Proceedings of the 28th international conference on Very Large Data Bases, p.323-334, August 20-23, 2002, Hong Kong, China
|
|
|
Graham Cormode , Mayur Datar , Piotr Indyk , S. Muthukrishnan, Comparing data streams using Hamming norms (how to zero in), Proceedings of the 28th international conference on Very Large Data Bases, p.335-345, August 20-23, 2002, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lior Cohen , Gil Avrahami-Bakish , Mark Last , Abraham Kandel , Oscar Kipersztok, Real-time data mining of non-stationary data streams from sensor networks, Information Fusion, v.9 n.3, p.344-353, July, 2008
|
|
|
|
|
|
Xuanhui Wang , ChengXiang Zhai , Xiao Hu , Richard Sproat, Mining correlated bursty topic patterns from coordinated text streams, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xiuyao Song , Chris Jermaine , Sanjay Ranka , John Gums, A bayesian mixture model with linear regression mixing proportions, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Albert Bifet , Geoff Holmes , Bernhard Pfahringer , Richard Kirkby , Ricard Gavaldà, New ensemble methods for evolving data streams, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|
|
|
|