|
ABSTRACT
Current research on data stream classification mainly focuses on supervised learning, in which a fully labeled data stream is needed for training. However, fully labeled data streams are expensive to obtain, which make the supervised learning approach difficult to be applied to real-life applications. In this paper, we model applications, such as credit fraud detection and intrusion detection, as a one-class data stream classification problem. The cost of fully labeling the data stream is reduced as users only need to provide some positive samples together with the unlabeled samples to the learner. Based on VFDT and POSC4.5, we propose our OcVFDT (One-class Very Fast Decision Tree) algorithm. Experimental study on both synthetic and real-life datasets shows that the OcVFDT has excellent classification performance. Even 80% of the samples in data stream are unlabeled, the classification performance of OcVFDT is still very close to that of VFDT, which is trained on fully labeled stream.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
B. Calvo, P. Larranaga, and J. A. Lozano. Learning Bayesian classifiers from positive and unlabeled examples. Pattern Recognition Letters, 28:2375--2384, 2007.
|
| |
2
|
F. Denis, R. Gilleron, and F. Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, pages 70--83, 2005.
|
| |
3
|
T. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895--1923, 1998.
|
| |
4
|
P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'00), pages 71--80. ACM New York, NY, USA, 2000.
|
| |
5
|
C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'08).
|
| |
6
|
U. M. Fayyad and K. B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8:87--102, 1992.
|
| |
7
|
G. Fung, J. Yu, H. Lu, and P. Yu. Text classification without negative examples revisit. IEEE Transactions on Knowledge and Data Engineering, 18(1):6--20, 2006.
|
| |
8
|
J. Gama, P. Medas, and P. Rodrigues. Learning Decision Trees from Dynamic Data Streams. Journal of Universal Computer Science, 11(8):1353--1366, 2005.
|
| |
9
|
J. Gama, R. Rocha, and P. Medas. Accurate decision trees for mining high-speed data streams. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'03), pages 523--528. ACM Press New York, NY, USA, 2003.
|
| |
10
|
G. Hulten, P. Domingos, and L. Spencer. Mining massive data streams. In The Journal of Machine Learning Research, 2005.
|
| |
11
|
G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'01), pages 97--106. ACM New York, NY, USA, 2001.
|
| |
12
|
R. Jin and G. Agrawal. Efficient decision tree construction on streaming data. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'03), pages 571--576. ACM New York, NY, USA, 2003.
|
| |
13
|
W. Lee and B. Liu. Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression. In Proceedings of Twentieth International Conference on Machine Learning. (ICML'03), volume 20, page 448, 2003.
|
| |
14
|
D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A New Benchmark Collection for Text Categorization Research. The Journal of Machine Learning Research, 5:361--397, 2004.
|
| |
15
|
B. Liu, Y. Dai, X. Li, W. Lee, and P. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the Third IEEE International Conference on Data Mining. (ICDM'03), pages 179--186, 2003.
|
| |
16
|
J. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
|
| |
17
|
B. Schölkopf, J. Platt, J. Shawe-Taylor, A. Smola, and R. Williamson. Estimating the Support of a High-Dimensional Distribution. Neural Computation, 13(7):1443--1471, 2001.
|
| |
18
|
W. Street and Y. Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'01), pages 377--382. ACM New York, NY, USA, 2001.
|
| |
19
|
H. Wang, W. Fan, P. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (SIGKDD'03), pages 226--235. ACM New York, NY, USA, 2003.
|
| |
20
|
H. Yu. Single-Class Classification with Mapping Convergence. Machine Learning, 61(1):49--69, 2005.
|
| |
21
|
H. Yu, J. Han, and K. Chang. PEBL: web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering, 16(1):70--81, 2004.
|
| |
22
|
Y. Zhang and X. Jin. An automatic construction and organization strategy for ensemble learning on data streams. ACM SIGMOD Record, 35(3):28--33, 2006.
|
| |
23
|
Y. Zhang, X. Li, and M. Orlowska. One-Class Classification of Text Streams with Concept Drift. In Proceedings of the Third IEEE International Conference on Data Mining Workshops. (ICDMW'08), pages 116--125, 2008.
|
| |
24
|
X. Zhu, X. Wu, and Y. Yang. Dynamic Classifier Selection for Effective Mining from Noisy Data Streams. In Proceedings of the Fourth IEEE International Conference on Data Mining. (ICDM'04), pages 305--312, 2004.
|
|