ACM Home Page
Please provide us with feedback. Feedback
Systematic data selection to mine concept-drifting data streams
Full text PdfPdf (227 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Seattle, WA, USA
SESSION: Research track papers table of contents
Pages: 128 - 137  
Year of Publication: 2004
ISBN:1-58113-888-1
Author
Wei Fan  IBM T.J.Watson Research, Hawthorne, NY
Sponsors
SIGMOD: ACM Special Interest Group on Management of Data
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 129,   Citation Count: 20
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1014052.1014069
What is a DOI?

ABSTRACT

One major problem of existing methods to mine data streams is that it makes ad hoc choices to combine most recent data with some amount of old data to search the new hypothesis. The assumption is that the additional old data always helps produce a more accurate hypothesis than using the most recent data only. We first criticize this notion and point out that using old data blindly is not better than "gambling"; in other words, it helps increase the accuracy only if we are "lucky." We discuss and analyze the situations where old data will help and what kind of old data will help. The practical problem on choosing the right example from old data is due to the formidable cost to compare different possibilities and models. This problem will go away if we have an algorithm that is extremely efficient to compare all sensible choices with little extra cost. Based on this observation, we propose a simple, efficient and accurate cross-validation decision tree ensemble method.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
 
5
Chen, Y., Dong, G., Han, J., Wah, B. W., and Wang, J. (2002). Multi-dimensional regression analysis of time-series data streams. In Proc. of Very Large Database (VLDB), Hong Kong, China.
6
 
7
Fan, W. (August 2004b). StreamMiner: A classifier ensemble-based engine to mine concept-drifting data streams. In Proceedings of 2004 International Conference on Very Large Data Bases (VLDB'2004), Toronto, Canada.
 
8
Fan, W. (July 2004a). On the optimality of probability estimation by random decision trees. In Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI'2004), San Jose, California, USA.
 
9
Fan, W., an Huang, Y., Wang, H., and Yu, P. S. (April 2004). Active mining of data streams. In Proceedings of 2004 SIAM International Conference on Data Mining, pages 457--461.
 
10
11
12
 
13
14
15
16

CITED BY  20