ACM Home Page
Please provide us with feedback. Feedback
Mining time-changing data streams
Full text PdfPdf (950 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
San Francisco, California
Pages: 97 - 106  
Year of Publication: 2001
ISBN:1-58113-391-X
Authors
Geoff Hulten  University of Washington, Seattle, WA
Laurie Spencer  Innovation Next, Seattle, WA
Pedro Domingos  University of Washington, Seattle, WA
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
AAAI : American Association for Artificial Intelligence
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 46,   Downloads (12 Months): 348,   Citation Count: 101
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/502512.502529
What is a DOI?

ABSTRACT

Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
R. Agrawal and G. Psaila. Active data mining. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 3-8, Montreal, Canada, 1995. AAAI Press.
2
 
3
 
4
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.
 
5
J. Catlett. Megainduction: Machine Learning on Very Large Databases. PhD thesis, Basser Department of Computer Science, University of Sydney, Sydney, Australia, 1991.
 
6
 
7
 
8
A. Danyluk, T. Fawcett, and F. Provost, editors. Proceedings of the AAAI-98 Workshop onPredicting the Future: AI Approaches to Time-Series Analysis. AAAI Press, Madison, WI, 1998.
9
10
 
11
12
 
13
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13-30, 1963.
14
 
15
M. Kubat and G. Widmer, editors. Proceedings of the ICML-96 Workshop on Learning in Context-Sensitive Domains. Bari, Italy, 1996.
 
16
 
17
 
18
R. G. Miller, Jr. Simultaneous Statistical Inference. Springer, New York, NY, 2nd edition, 1981.
 
19
R. Musick, J. Catlett, and S. Russell. Decision theoretic subsampling for induction on large databases. In Proceedings of the Tenth International Conference on Machine Learning, pages 212-219, Amherst, MA, 1993. Morgan Kaufmann.
 
20
 
21
M. Salganicoff. Density-adaptive learning and forgetting. In Proceedings of the Tenth International Conference on Machine Learning, pages 276-283, Amherst, MA, 1993. Morgan Kaufmann.
 
22
 
23
J. C. Schlimmer and R. H. Granger, Jr. Beyond incremental processing: Tracking concept drift. In Proceedings of the Fifth National Conference on Artificial Intelligence, pages 502-507, Philadelphia, PA, 1986. Morgan Kaufmann.
 
24
 
25
P. Turney. Context-sensitive learning bibliography. Online bibliography, Institute for Information Technology of the National Research Council of Canada, Ottawa, Canada, 1998. http://ai.iit.nrc.ca/- bibliographies/eontext-sensitive.html.
 
26
S. Vijayakumar and S. Schaal, editors. Proceedings of the NIPS-2000 Workshop on Real-Time Modeling for Complex Learning Tasks. NIPS Foundation, Breckenridge, Colorado, 2000.
 
27
 
28
G. Widmer and M. Kubat. Special issue on context sensitivity and concept drift: Machine Learning, 32(2), 1998.
 
29
A. Wolman, G. Voelker, N. Sharma, N. Cardweli, M. Brown, T. Landray, D. Pinnel, A. Karlin, and H. Levy. Organization-based analysis of Web-object sharing and caching. In Proceedings of the Second USENIX Conference on Internet Technologies and Systems, pages 25-36, Boulder, CO, 1999.

CITED BY  101

Collaborative Colleagues:
Geoff Hulten: colleagues
Laurie Spencer: colleagues
Pedro Domingos: colleagues