| Categorizing and mining concept drifting data streams |
| Full text |
Pdf
(508 KB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Las Vegas, Nevada, USA
SESSION: Research papers
table of contents
Pages 812-820
Year of Publication: 2008
ISBN:978-1-60558-193-4
|
|
Authors
|
|
Peng Zhang
|
Chinese Academy of Sciences, Beijing, China
|
|
Xingquan Zhu
|
Florida Atlantic University, Boca Raton, FL, USA
|
|
Yong Shi
|
University of Nebraska at Omaha, Nebraska, NE, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 42, Downloads (12 Months): 445, Citation Count: 1
|
|
|
ABSTRACT
Mining concept drifting data streams is a defining challenge for data mining research. Recent years have seen a large body of work on detecting changes and building prediction models from stream data, with a vague understanding on the types of the concept drifting and the impact of different types of concept drifting on the mining algorithms. In this paper, we first categorize concept drifting into two scenarios: Loose Concept Drifting (LCD) and Rigorous Concept Drifting (RCD), and then propose solutions to handle each of them separately. For LCD data streams, because concepts in adjacent data chunks are sufficiently close to each other, we apply kernel mean matching (KMM) method to minimize the discrepancy of the data chunks in the kernel space. Such a minimization process will produce weighted instances to build classifier ensemble and handle concept drifting data streams. For RCD data streams, because genuine concepts in adjacent data chunks may randomly and rapidly change, we propose a new Optimal Weights Adjustment (OWA) method to determine the optimum weight values for classifiers trained from the most recent (up-to-date) data chunk, such that those classifiers can form an accurate classifier ensemble to predict instances in the yet-to-come data chunk. Experiments on synthetic and real-world datasets will show that weighted instance approach is preferable when the concept drifting is mainly caused by the changing of the class prior probability; whereas the weighted classifier approach is preferable when the concept drifting is mainly triggered by the changing of the conditional probability.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
 |
3
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
| |
4
|
|
| |
5
|
Yixin Chen , Guozhu Dong , Jiawei Han , Benjamin W. Wah , Jianyong Wang, Multi-dimensional regression analysis of time-series data streams, Proceedings of the 28th international conference on Very Large Data Bases, p.323-334, August 20-23, 2002, Hong Kong, China
|
 |
6
|
Charu C. Aggarwal , Jiawei Han , Jianyong Wang , Philip S. Yu, On demand classification of data streams, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
[doi> 10.1145/1014052.1014110]
|
| |
7
|
|
 |
8
|
|
| |
9
|
|
 |
10
|
|
 |
11
|
|
| |
12
|
M. Scholz and R. Klinkenberg. 2005. An Ensemble Classifier for Drifting Concepts. In Proc. of the 2nd International Workshop on Knowledge Discovery in Data Streams.
|
 |
13
|
|
| |
14
|
|
 |
15
|
Wenyuan Dai , Qiang Yang , Gui-Rong Xue , Yong Yu, Boosting for transfer learning, Proceedings of the 24th international conference on Machine learning, p.193-200, June 20-24, 2007, Corvalis, Oregon
[doi> 10.1145/1273496.1273521]
|
| |
16
|
H. Shimodaira, 2000. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90,227--244.
|
| |
17
|
M. Sugiyama, & K. Müüller, 2005. Model selection under covariate shift. In Proc. of ICANN.
|
 |
18
|
|
| |
19
|
Bickel, S., & Scheffer, T. 2007. Dirichlet-enhanced spam filtering based on biased samples. Advances in Neural Information Processing Systems.
|
| |
20
|
M. Dudik, R. Schapire, & S. Phillips, 2005. Correcting sample selection bias in maximum entropy density estimation. Advances in Neural Info. Processing Systems.
|
| |
21
|
J. Huang, A. Smola, A. Gretton, K. Borgwardt, & B. Schöölkopf, 2007. Correcting sample selection bias by unlabeled data. Advances in Neural Info. Proc. Systems.
|
| |
22
|
K. Tumer & J. Ghosh.1996. Analysis of decision boundaries in linearly combined neural classifiers, Pattern Recognition, 29(2).
|
| |
23
|
|
| |
24
|
|
CITED BY
|
|
Albert Bifet , Geoff Holmes , Bernhard Pfahringer , Richard Kirkby , Ricard Gavaldà, New ensemble methods for evolving data streams, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|