ACM Home Page
Please provide us with feedback. Feedback
Approximately detecting duplicates for streaming data using stable bloom filters
Full text PdfPdf (304 KB)
Source International Conference on Management of Data archive
Proceedings of the 2006 ACM SIGMOD international conference on Management of data table of contents
Chicago, IL, USA
SESSION: P2P systems & data streams table of contents
Pages: 25 - 36  
Year of Publication: 2006
ISBN:1-59593-434-0
Authors
Fan Deng  University of Alberta, Edmonton, Alberta, Canada
Davood Rafiei  University of Alberta, Edmonton, Alberta, Canada
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 123,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1142473.1142477
What is a DOI?

ABSTRACT

Traditional duplicate elimination techniques are not applicable to many data stream applications. In general, precisely eliminating duplicates in an unbounded data stream is not feasible in many streaming scenarios. Therefore, we target at approximately eliminating duplicates in streaming environments given a limited space. Based on a well-known bitmap sketch, we introduce a data structure, Stable Bloom Filter, and a novel and simple algorithm. The basic idea is as follows: since there is no way to store the whole history of the stream, SBF continuously evicts the stale information so that SBF has room for those more recent elements. After finding some properties of SBF analytically, we show that a tight upper bound of false positive rates is guaranteed. In our empirical study, we compare SBF to alternative methods. The results show that our method is superior in terms of both accuracy and time effciency when a fixed small space and an acceptable false positive rate are given.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
[1] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proc. of VLDB, 2002.
 
2
[2] Internet Archive. http://www.archive.org/.
3
 
4
 
5
[5] F. Baboescu, S. Singh, and G. Varghese. Packet classification for core routers: Is there an alternative to cams? In Proc. of INFOCOMM, 2003.
6
7
8
 
9
[9] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. B. Zdonik. Monitoring streams - a new class of data management applications. In Proc. of VLDB, 2002.
 
10
[10] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. Telegraphcq: Continuous dataflow processing for an uncertain world. In Proc. of CIDR, 2003.
 
11
12
13
14
 
15
16
 
17
[17] C. Estan and G. Varghese. Data streaming in computer networks. In Proc. of Workshop on Management and Processing of Data Streams (MPDS) in cooperation with SIGMOD/PODS, 2003.
 
18
 
19
 
20
 
21
 
22
[22] Cisco System Inc. Cisco network accounting services. http://www.cisco.com/warp/public/cc/pd/iosw/ prodlit/nwact_wp.pdf, 2002.
 
23
[23] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of VLDB, 2002.
24
 
25
 
26
 
27
[27] M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
 
28
[28] N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker. Load shedding in a data stream manager. In Proc. of VLDB, 2003.
 
29
[29] P. A. Tucker, D. Maier, and T. Sheard. Applying punctuation schemes to queries over continuous data streams. IEEE Data Eng. Bull., 26(1):33-40, 2003.
30
31