ACM Home Page
Please provide us with feedback. Feedback
Sampling time-based sliding windows in bounded space
Full text PdfPdf (525 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data table of contents
Vancouver, Canada
SESSION: Research Session 9: Strings and Time table of contents
Pages 379-392  
Year of Publication: 2008
ISBN:978-1-60558-102-6
Authors
Rainer Gemulla  Technische Universität Dresden, Dresden, Germany
Wolfgang Lehner  Technische Universität Dresden, Dresden, Germany
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 174,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376616.1376657
What is a DOI?

ABSTRACT

Random sampling is an appealing approach to build synopses of large data streams because random samples can be used for a broad spectrum of analytical tasks. Users are often interested in analyzing only the most recent fraction of the data stream in order to avoid outdated results. In this paper, we focus on sampling schemes that sample from a sliding window over a recent time interval; such windows are a popular and highly comprehensible method to model recency. In this setting, the main challenge is to guarantee an upper bound on the space consumption of the sample while using the allotted space efficiently at the same time. The difficulty arises from the fact that the number of items in the window is unknown in advance and may vary significantly over time, so that the sampling fraction has to be adjusted dynamically. We consider uniform sampling schemes, which produce each sample of the same size with equal probability, and stratified sampling schemes, in which the window is divided into smaller strata and a uniform sample is maintained per stratum. For uniform sampling, we prove that it is impossible to guarantee a minimum sample size in bounded space. We then introduce a novel sampling scheme called bounded priority sampling (BPS), which requires only bounded space. We derive a lower bound on the expected sample size and show that BPS quickly adapts to changing data rates. For stratified sampling, we propose a merge-based stratification scheme (MBS), which maintains strata of approximately equal size. Compared to naive stratification, MBS has the advantage that the sample is evenly distributed across the window, so that no part of the window is over- or underrepresented. We conclude the paper with a feasibility study of our algorithms on large real-world datasets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
 
4
5
 
6
 
7
 
8
 
9
Peter J. Haas. Data stream sampling: Basic techniques and results. In Data Stream Management: Processing High Speed Data Streams. Springer, 2006.
 
10
11
 
12
Carl-Erik Särndal, Bengt Swensson, and Jan Wretman. Model Assisted Survey Sampling. Springer Series in Statistics. Springer, 1991.
 
13
Raimund Seidel and Cecilia R. Aragon. Randomized search trees. Algorithmica, 16(4/5):464--497, 1996.
14
15


Collaborative Colleagues:
Rainer Gemulla: colleagues
Wolfgang Lehner: colleagues