|
ABSTRACT
A Bloom Filter is a space-efficient randomized data structure allowing membership queries over sets with certain allowable errors. It is widely used in many applications which take advantage of its ability to compactly represent a set, and filter out effectively any element that does not belong to the set, with small error probability. This paper introduces the Spectral Bloom Filter (SBF), an extension of the original Bloom Filter to multi-sets, allowing the filtering of elements whose multiplicities are below a threshold given at query time. Using memory only slightly larger than that of the original Bloom Filter, the SBF supports queries on the multiplicities of individual keys with a guaranteed, small error probability. The SBF also supports insertions and deletions over the data set. We present novel methods for reducing the probability and magnitude of errors. We also present an efficient data structure and algorithms to build it incrementally and maintain it over streaming data, as well as over materialized data with arbitrary insertions and deletions. The SBF does not assume any a priori filtering threshold and effectively and efficiently maintains information over the entire data-set, allowing for ad-hoc queries with arbitrary parameters and enabling a range of new applications.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
 |
2
|
|
| |
3
|
A. Broder and M. Mitzenmacher. Network applications of Bloom Filters: A survey. In Proc. of Allerton Conference, 2002.
|
| |
4
|
A. Z. Broder. Personal communication.
|
| |
5
|
S. Cohen and Y. Matias. Spectral bloom filters, Technical Report. Tel Aviv University, 2003.
|
| |
6
|
Mayur Datar , Aristides Gionis , Piotr Indyk , Rajeev Motwani, Maintaining stream statistics over sliding windows: (extended abstract), Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, p.635-644, January 06-08, 2002, San Francisco, California
|
| |
7
|
P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194--202, 1975.
|
 |
8
|
Cristian Estan , George Varghese, New directions in traffic measurement and accounting, Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications, August 19-23, 2002, Pittsburgh, Pennsylvania, USA
|
| |
9
|
L. Fan, P. Cao, and J. Almeida. A prototype implementation of summary-cache enhanced icp in squid 1.1.14. www.cs.wisc.edu/~cao/sc-icp.html.
|
 |
10
|
|
| |
11
|
|
 |
12
|
Sumit Ganguly , Phillip B. Gibbons , Yossi Matias , Avi Silberschatz, Bifocal sampling for skew-resistant join size estimation, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.271-281, June 04-06, 1996, Montreal, Quebec, Canada
|
 |
13
|
|
| |
14
|
|
 |
15
|
|
 |
16
|
Peter J. Haas , Jeffrey F. Naughton , S. Seshadri , Arun N. Swami, Fixed-precision estimation of join selectivity, Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.190-201, May 25-28, 1993, Washington, D.C., United States
[doi> 10.1145/153850.153875]
|
| |
17
|
|
| |
18
|
|
| |
19
|
G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of the 28th International Conference on Very Large Data Bases, VLDB, 2002.
|
| |
20
|
Y. Matias. Bloom Histograms, July 2001.
|
 |
21
|
|
| |
22
|
S. Rhea and J. Kubiatowicz. Probabilistic location and routing. In Proc. of the 21st Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), 2002.
|
| |
23
|
Squid Web Proxy Cache. http://www.squid-cache.org.
|
CITED BY 24
|
|
Abhishek Kumar , Jun (Jim) Xu , Li Li , Jia Wang, Space-code bloom filter for efficient traffic flow measurement, Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, October 27-29, 2003, Miami Beach, FL, USA
|
|
|
|
|
|
Cheqing Jin , Weining Qian , Chaofeng Sha , Jeffrey X. Yu , Aoying Zhou, Dynamically maintaining frequent items over a data stream, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Qi (George) Zhao , Mitsunori Ogihara , Haixun Wang , Jun (Jim) Xu, Finding global icebergs over distributed data sets, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 26-28, 2006, Chicago, IL, USA
|
|
|
Qi (George) Zhao , Mitsunori Ogihara , Haixun Wang , Jun (Jim) Xu, Finding global icebergs over distributed data sets, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 26-28, 2006, Chicago, IL, USA
|
|
|
Dong Hyuk Woo , Mrinmoy Ghosh , Emre Özer , Stuart Biles , Hsien-Hsin S. Lee, Reducing energy of virtual cache synonym lookup using bloom filters, Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, October 22-25, 2006, Seoul, Korea
|
|
|
|
|
|
|
|
|
Wei Wang , Haifeng Jiang , Hongjun Lu , Jeffrey Xu Yu, Bloom histogram: path selectivity estimation for XML data with updates, Proceedings of the Thirtieth international conference on Very large data bases, p.240-251, August 31-September 03, 2004, Toronto, Canada
|
|
|
Jeffery Xu Yu , Zhihong Chong , Hongjun Lu , Aoying Zhou, False positive or false negative: mining frequent itemsets from high speed transactional data streams, Proceedings of the Thirtieth international conference on Very large data bases, p.204-215, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|