ACM Home Page
Please provide us with feedback. Feedback
Processing set expressions over continuous update streams
Full text PdfPdf (283 KB)
Source International Conference on Management of Data archive
Proceedings of the 2003 ACM SIGMOD international conference on Management of data table of contents
San Diego, California
SESSION: Stream query processing II table of contents
Pages: 265 - 276  
Year of Publication: 2003
ISBN:1-58113-634-X
Authors
Sumit Ganguly  Bell Laboratories, Lucent Technologies, Murray Hill, NJ
Minos Garofalakis  Bell Laboratories, Lucent Technologies, Murray Hill, NJ
Rajeev Rastogi  Bell Laboratories, Lucent Technologies, Murray Hill, NJ
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 41,   Citation Count: 24
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/872757.872790
What is a DOI?

ABSTRACT

There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising data-item deletions as well as insertions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations, or processing of retail-chain transactions).Estimating the cardinality of set expressions defined over several (perhaps, distributed) update streams is perhaps one of the most fundamental query classes of interest; as an example, such a query may ask "what is the number of distinct IP source addresses seen in passing packets from both router R1 and R2 but not router R3?". Earlier work has only addressed very restricted forms of this problem, focusing solely on the special case of insert-only streams and specific operators (e.g., union). In this paper, we propose the first space-efficient algorithmic solution for estimating the cardinality of full-fledged set expressions over general update streams. Our estimation algorithms are probabilistic in nature and rely on a novel, hash-based synopsis data structure, termed "2-level hash sketch". We demonstrate how our 2-level hash sketch synopses can be used to provide low-error, high-confidence estimates for the cardinality of set expressions (including operators such as set union, intersection, and difference) over continuous update streams, using only small space and small processing time per update. Furthermore, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algorithms is within small factors of the optimal. Preliminary experimental results verify the effectiveness of our approach.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
N. Alon and J. H. Spencer. "The Probabilistic Method". John Wiley & Sons, Inc., 1992.
 
4
5
6
7
 
8
 
9
10
 
11
 
12
 
13
M. Garofalakis, J. Gehrke, and R. Rastogi. "Querying and Mining Data Streams: You Only Get One Look". Tutorial in 28th Intl. Conf. on Very Large Data Bases, August 2002.
 
14
15
 
16
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. "How to Summarize the Universe: Dynamic Maintenance of Quantiles". In Proc. of the 28th Intl. Conf. on Very Large Data Bases, August 2002.
 
17
 
18
 
19
 
20
 
21
 
22

CITED BY  24

Collaborative Colleagues:
Sumit Ganguly: colleagues
Minos Garofalakis: colleagues
Rajeev Rastogi: colleagues