|
ABSTRACT
One major problem of existing methods to mine data streams is that it makes ad hoc choices to combine most recent data with some amount of old data to search the new hypothesis. The assumption is that the additional old data always helps produce a more accurate hypothesis than using the most recent data only. We first criticize this notion and point out that using old data blindly is not better than "gambling"; in other words, it helps increase the accuracy only if we are "lucky." We discuss and analyze the situations where old data will help and what kind of old data will help. The practical problem on choosing the right example from old data is due to the formidable cost to compare different possibilities and models. This problem will go away if we have an algorithm that is extremely efficient to compare all sensible choices with little extra cost. Based on this observation, we propose a simple, efficient and accurate cross-validation decision tree ensemble method.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
 |
3
|
|
| |
4
|
|
| |
5
|
Chen, Y., Dong, G., Han, J., Wah, B. W., and Wang, J. (2002). Multi-dimensional regression analysis of time-series data streams. In Proc. of Very Large Database (VLDB), Hong Kong, China.
|
 |
6
|
|
| |
7
|
Fan, W. (August 2004b). StreamMiner: A classifier ensemble-based engine to mine concept-drifting data streams. In Proceedings of 2004 International Conference on Very Large Data Bases (VLDB'2004), Toronto, Canada.
|
| |
8
|
Fan, W. (July 2004a). On the optimality of probability estimation by random decision trees. In Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI'2004), San Jose, California, USA.
|
| |
9
|
Fan, W., an Huang, Y., Wang, H., and Yu, P. S. (April 2004). Active mining of data streams. In Proceedings of 2004 SIAM International Conference on Data Mining, pages 457--461.
|
| |
10
|
|
 |
11
|
|
 |
12
|
|
| |
13
|
|
 |
14
|
|
 |
15
|
|
 |
16
|
|
CITED BY 20
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Larry Shoemaker , Robert E. Banfield , Lawrence O. Hall , Kevin W. Bowyer , W. Philip Kegelmeyer, Using classifier ensembles to label spatially disjoint data, Information Fusion, v.9 n.1, p.120-133, January, 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mohammad M. Masud , Jing Gao , Latifur Khan , Jiawei Han , Bhavani Thuraisingham, Peer to peer botnet detection for cyber-security: a data mining approach, Proceedings of the 4th annual workshop on Cyber security and informaiton intelligence research: developing strategies to meet the cyber security and information intelligence challenges ahead, May 12-14, 2008, Oak Ridge, Tennessee
|
|
|
Jing Gao , Wei Fan , Jing Jiang , Jiawei Han, Knowledge transfer via multiple model local structure mapping, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|