|
ABSTRACT
In data stream clustering, it is desirable to have algorithms that are able to detect clusters of arbitrary shape, clusters that evolve over time, and clusters with noise. Existing stream data clustering algorithms are generally based on an online-offline approach: The online component captures synopsis information from the data stream (thus, overcoming real-time and memory constraints) and the offline component generates clusters using the stored synopsis. The online-offline approach affects the overall performance of stream data clustering in various ways: the ease of deriving synopsis from streaming data; the complexity of data structure for storing and managing synopsis; and the frequency at which the offline component is used to generate clusters. In this article, we propose an algorithm that (1) computes and updates synopsis information in constant time; (2) allows users to discover clusters at multiple resolutions; (3) determines the right time for users to generate clusters from the synopsis information; (4) generates clusters of higher purity than existing algorithms; and (5) determines the right threshold function for density-based clustering based on the fading model of stream data. To the best of our knowledge, no existing data stream algorithms has all of these features. Experimental results show that our algorithm is able to detect arbitrarily shaped, evolving clusters with high quality.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Charu C. Aggarwal , Jiawei Han , Jianyong Wang , Philip S. Yu, A framework for clustering evolving data streams, Proceedings of the 29th international conference on Very large data bases, p.81-92, September 09-12, 2003, Berlin, Germany
|
 |
2
|
Brain Babcock , Mayur Datar , Rajeev Motwani , Liadan O'Callaghan, Maintaining variance and k-medians over data stream windows, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.234-243, June 09-11, 2003, San Diego, California
[doi> 10.1145/773153.773176]
|
| |
3
|
Cao, F., Ester, M., Qian, W., and Zhou, A. 2006. Density-based clustering over an evolving data stream with noise. In Proceedings of the SIAM Conference on Data Mining.
|
 |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
Lian Duan , Lida Xu , Feng Guo , Jun Lee , Baopin Yan, A local-density based spatial clustering algorithm with noise, Information Systems, v.32 n.7, p.978-986, November, 2007
[doi> 10.1016/j.is.2006.10.006]
|
| |
8
|
|
| |
9
|
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 226--231.
|
| |
10
|
|
| |
11
|
Hinneburg, E. and Keim, D. A. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. AAAI Press, 58--65.
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
 |
15
|
Haixun Wang , Jian Yin , Jian Pei , Philip S. Yu , Jeffrey Xu Yu, Suppressing model overfitting in mining concept-drifting data streams, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[doi> 10.1145/1150402.1150496]
|
| |
16
|
|
| |
17
|
Yang, J. 2003. Dynamic clustering of evolving streams with a single pass. In Proceedings of the International Conference on Data Engineering. 695.
|
|