|
ABSTRACT
There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising data-item deletions as well as insertions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations, or processing of retail-chain transactions).Estimating the cardinality of set expressions defined over several (perhaps, distributed) update streams is perhaps one of the most fundamental query classes of interest; as an example, such a query may ask "what is the number of distinct IP source addresses seen in passing packets from both router R1 and R2 but not router R3?". Earlier work has only addressed very restricted forms of this problem, focusing solely on the special case of insert-only streams and specific operators (e.g., union). In this paper, we propose the first space-efficient algorithmic solution for estimating the cardinality of full-fledged set expressions over general update streams. Our estimation algorithms are probabilistic in nature and rely on a novel, hash-based synopsis data structure, termed "2-level hash sketch". We demonstrate how our 2-level hash sketch synopses can be used to provide low-error, high-confidence estimates for the cardinality of set expressions (including operators such as set union, intersection, and difference) over continuous update streams, using only small space and small processing time per update. Furthermore, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algorithms is within small factors of the optimal. Preliminary experimental results verify the effectiveness of our approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Noga Alon , Phillip B. Gibbons , Yossi Matias , Mario Szegedy, Tracking join and self-join sizes in limited storage, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.10-20, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.303978]
|
 |
2
|
Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237814.237823]
|
| |
3
|
N. Alon and J. H. Spencer. "The Probabilistic Method". John Wiley & Sons, Inc., 1992.
|
| |
4
|
|
 |
5
|
Andrei Z. Broder , Moses Charikar , Alan M. Frieze , Michael Mitzenmacher, Min-wise independent permutations (extended abstract), Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.327-336, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276781]
|
 |
6
|
Moses Charikar , Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Towards estimation error guarantees for distinct values, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.268-279, May 15-18, 2000, Dallas, Texas, United States
[doi> 10.1145/335168.335230]
|
 |
7
|
Zhiyuan Chen , Nick Koudas , Flip Korn , S. Muthukrishnan, Selectively estimation for Boolean queries, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.216-225, May 15-18, 2000, Dallas, Texas, United States
[doi> 10.1145/335168.335225]
|
| |
8
|
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
M. Garofalakis, J. Gehrke, and R. Rastogi. "Querying and Mining Data Streams: You Only Get One Look". Tutorial in 28th Intl. Conf. on Very Large Data Bases, August 2002.
|
| |
14
|
|
 |
15
|
|
| |
16
|
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. "How to Summarize the Universe: Dynamic Maintenance of Quantiles". In Proc. of the 28th Intl. Conf. on Very Large Data Bases, August 2002.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
CITED BY 24
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Matthias Bender , Sebastian Michel , Peter Triantafillou , Gerhard Weikum , Christian Zimmer, Improving collection selection with overlap awareness in P2P search engines, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abhinandan Das , Sumit Ganguly , Minos Garofalakis , Rajeev Rastogi, Distributed set-expression cardinality estimation, Proceedings of the Thirtieth international conference on Very large data bases, p.312-323, August 31-September 03, 2004, Toronto, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lu-An Tang , Bin Gui , Hong-Yan Li , Gao-Shan Miao , Dong-Qing Yang , Xin-Biao Zhou, PGG: an online pattern based approach for stream variation management, Journal of Computer Science and Technology, v.23 n.4, p.497-515, July 2008
|
|
|
|
|