|
ABSTRACT
Traditional duplicate elimination techniques are not applicable to many data stream applications. In general, precisely eliminating duplicates in an unbounded data stream is not feasible in many streaming scenarios. Therefore, we target at approximately eliminating duplicates in streaming environments given a limited space. Based on a well-known bitmap sketch, we introduce a data structure, Stable Bloom Filter, and a novel and simple algorithm. The basic idea is as follows: since there is no way to store the whole history of the stream, SBF continuously evicts the stale information so that SBF has room for those more recent elements. After finding some properties of SBF analytically, we show that a tight upper bound of false positive rates is guaranteed. In our empirical study, we compare SBF to alternative methods. The results show that our method is superior in terms of both accuracy and time effciency when a fixed small space and an acceptable false positive rate are given.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
[1] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proc. of VLDB, 2002.
|
| |
2
|
[2] Internet Archive. http://www.archive.org/.
|
 |
3
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
| |
4
|
|
| |
5
|
[5] F. Baboescu, S. Singh, and G. Varghese. Packet classification for core routers: Is there an alternative to cams? In Proc. of INFOCOMM, 2003.
|
 |
6
|
|
 |
7
|
|
 |
8
|
|
| |
9
|
[9] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. B. Zdonik. Monitoring streams - a new class of data management applications. In Proc. of VLDB, 2002.
|
| |
10
|
[10] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. Telegraphcq: Continuous dataflow processing for an uncertain world. In Proc. of CIDR, 2003.
|
| |
11
|
|
 |
12
|
Jianjun Chen , David J. DeWitt , Feng Tian , Yuan Wang, NiagaraCQ: a scalable continuous query system for Internet databases, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.379-390, May 15-18, 2000, Dallas, Texas, United States
|
 |
13
|
|
 |
14
|
|
| |
15
|
|
 |
16
|
|
| |
17
|
[17] C. Estan and G. Varghese. Data streaming in computer networks. In Proc. of Workshop on Management and Processing of Data Streams (MPDS) in cooperation with SIGMOD/PODS, 2003.
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
[22] Cisco System Inc. Cisco network accounting services. http://www.cisco.com/warp/public/cc/pd/iosw/ prodlit/nwact_wp.pdf, 2002.
|
| |
23
|
[23] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of VLDB, 2002.
|
 |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
[27] M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
|
| |
28
|
[28] N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker. Load shedding in a data stream manager. In Proc. of VLDB, 2003.
|
| |
29
|
[29] P. A. Tucker, D. Maier, and T. Sheard. Applying punctuation schemes to queries over continuous data streams. IEEE Data Eng. Bull., 26(1):33-40, 2003.
|
 |
30
|
|
 |
31
|
|
|