|
ABSTRACT
A sliding windows model is an important case of the streaming model, where only the most "recent" elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gionis, Indyk and Motwani (SODA 02)). There are two equally important types of the sliding windows model -- windows with fixed size, (e.g., where items arrive one at a time, and only the most recent n items remain active for some fixed parameter n), and bursty windows (e.g., where many items can arrive in "bursts" at a single step and where only items from the last t steps remain active, again for some fixed parameter t). Random sampling is a fundamental tool for data streams, as numerous algorithms operate on the sampled data instead of on the entire stream. Effective sampling from sliding windows is a nontrivial problem, as elements eventually expire. In fact, the deletions are implicit; i.e., it is not possible to identify deleted elements without storing the entire window. The implicit nature of deletions on sliding windows does not allow the existing methods (even those that support explicit deletions, e.g., Cormode, Muthukrishnan and Rozenbaum (VLDB 05); Frahling, Indyk and Sohler (SOCG 05)) to be directly "translated" to the sliding windows model. One trivial approach to overcoming the problem of implicit deletions is that of over-sampling. When k samples are required, the over-sampling method maintains k'>k samples in the hope that at least k samples are not expired. The obvious disadvantages of this method are twofold: (a) It introduces additional costs and thus decreases the performance; and (b) The memory bounds are not deterministic, which is atypical for streaming algorithms (where even small probability events may eventually happen for a stream that is big enough). Babcock, Datar and Motwani (SODA 02), were the first to stress the importance of improvements to over-sampling. They formally introduced the problem of sampling from sliding windows and improved the over-sampling method for sampling with replacement. Their elegant solutions for sampling with replacement are optimal in expectation, and thus resolve disadvantage (a) mentioned above. Unfortunately, the randomized bounds do not resolve disadvantage (b) above. Interestingly, all algorithms that employ the ideas of Babcock, Datar and Motwani have the same central problem of having to deal with randomized complexity (see e.g., Datar and Muthukrishnan (ESA 02); Chakrabarti, Cormode and McGregor (SODA 07)). Further, the proposed solutions of Babcock, Datar and Motwani for sampling without replacement are based on the criticized over-sampling method and thus do not solve problem (a). Therefore, the question of whether we can solve sampling on sliding windows optimally (i.e., resolving both disadvantages) is implicit in the paper of Babcock, Datar and Motwani and has remained open for all variants of the problem. In this paper we answer these questions affirmatively and provide optimal sampling schemas for all variants of the problem, i.e., sampling with or without replacement from fixed or bursty windows. Specifically, for fixed-size windows, we provide optimal solutions that require O(k) memory; for bursty windows, we show algorithms that require O(klogn), which is optimal since it matches the lower bound by Gemulla and Lehner (SIGMOD 08). In contrast to the work of of Babcock, Datar and Motwani, our solutions have deterministic bounds. Thus, we prove a perhaps somewhat surprising fact: the memory complexity of the sampling-based algorithm for all variants of the sliding windows model is comparable with that of streaming models (i.e., without the sliding windows). This is the first result of this type, since all previous "translations" of sampling-based algorithms to sliding windows incur randomized memory guarantees only.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
|
 |
4
|
Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237814.237823]
|
| |
5
|
A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, J. Widom, "STREAM: The Stanford Data Stream Management System," Book Chapter, "Data-Stream Management: Processing High-Speed Data Streams", Springer-Verlag, 2005.
|
 |
6
|
|
 |
7
|
|
 |
8
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
 |
12
|
Brain Babcock , Mayur Datar , Rajeev Motwani , Liadan O'Callaghan, Maintaining variance and k-medians over data stream windows, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.234-243, June 09-11, 2003, San Diego, California
[doi> 10.1145/773153.773176]
|
 |
13
|
|
| |
14
|
|
| |
15
|
Ziv Bar-Yossef , Ravi Kumar , D. Sivakumar, Reductions in streaming algorithms, with an application to counting triangles in graphs, Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, p.623-632, January 06-08, 2002, San Francisco, California
|
| |
16
|
|
 |
17
|
|
 |
18
|
Lakshminath Bhuvanagiri , Sumit Ganguly , Deepanjan Kesh , Chandan Saha, Simpler algorithm for estimating frequency moments of data streams, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, p.708-713, January 22-26, 2006, Miami, Florida
[doi> 10.1145/1109557.1109634]
|
| |
19
|
|
 |
20
|
Luciana S. Buriol , Gereon Frahling , Stefano Leonardi , Alberto Marchetti-Spaccamela , Christian Sohler, Counting triangles in data streams, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 26-28, 2006, Chicago, IL, USA
[doi> 10.1145/1142351.1142388]
|
| |
21
|
|
| |
22
|
A. Chakrabarti, K. Do Ba, S. Muthukrishnan, "Estimating Entropy and Entropy Norm on Data Streams", In Proceedings of the 23rd International Symposium on Theoretical Aspects of Computer Science, 2006.
|
| |
23
|
A. Chakrabarti, S. Khot, X. Sun, "Near-optimal lower bounds on the multi-party communication complexity of set-disjointness", Proceedings of the 18th Annual IEEE Conference on Computational Complexity, 2003.
|
 |
24
|
|
| |
25
|
|
| |
26
|
K. Chaudhuri, N. Mishra, "When Random Sampling Preserves Privacy", CRYPTO, 2006.
|
 |
27
|
Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, On random sampling over joins, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.263-274, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
28
|
|
| |
29
|
|
 |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
|
| |
34
|
|
| |
35
|
Anirban Dasgupta , Petros Drineas , Boulos Harb , Ravi Kumar , Michael W. Mahoney, Sampling algorithms and coresets for ℓp regression, Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, p.932-941, January 20-22, 2008, San Francisco, California
|
| |
36
|
Mayur Datar , Aristides Gionis , Piotr Indyk , Rajeev Motwani, Maintaining stream statistics over sliding windows: (extended abstract), Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, p.635-644, January 06-08, 2002, San Francisco, California
|
| |
37
|
|
 |
38
|
|
| |
39
|
J. Feigenbaum, S. Kannan, and J. Zhang, "Computing diameter in the streaming and sliding-window models", Algorithmica, 41:25--41, 2005.
|
| |
40
|
Joan Feigenbaum , Sampath Kannan , Andrew McGregor , Siddharth Suri , Jian Zhang, Graph distances in the streaming model: the value of space, Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, January 23-25, 2005, Vancouver, British Columbia
|
| |
41
|
|
| |
42
|
J. Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan, "Testing and Spot-Checking of Data Streams", Algorithmica, 34(1): 67--80, 2002.
|
 |
43
|
|
| |
44
|
S. Ganguly. "Estimating Frequency Moments of Update Streams using Random Linear Combinations". Proceedings of the 8th International Workshop on Randomized Algorithms, pp. 369--380, 2004.
|
| |
45
|
|
 |
46
|
|
| |
47
|
R. Gemulla, "Sampling Algorithms for Evolving Datasets", PhD Dissertation.
|
 |
48
|
|
 |
49
|
|
 |
50
|
|
 |
51
|
Lukasz Golab , David DeHaan , Erik D. Demaine , Alejandro Lopez-Ortiz , J. Ian Munro, Identifying frequent items in sliding windows over on-line packet streams, Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, October 27-29, 2003, Miami Beach, FL, USA
[doi> 10.1145/948205.948227]
|
| |
52
|
|
 |
53
|
|
| |
54
|
|
| |
55
|
P.J. Haas, "Data stream sampling: Basic techniques and results", In M. Garofalakis, J. Gehrke, and R. Rastogi (Eds.), Data Stream Management: Processing High Speed Data Streams, Springer.
|
| |
56
|
|
 |
57
|
|
| |
58
|
H. Jowhari, M. Ghodsi, "New streaming algorithms for counting triangles in graphs", Proceedings of the 11th COCOON, pp. 710--716, 2005.
|
 |
59
|
|
 |
60
|
|
 |
61
|
|
 |
62
|
|
 |
63
|
Jin Li , David Maier , Kristin Tufte , Vassilis Papadimos , Peter A. Tucker, Semantics and evaluation techniques for window aggregates in data streams, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
[doi> 10.1145/1066157.1066193]
|
 |
64
|
|
| |
65
|
|
| |
66
|
|
 |
67
|
|
| |
68
|
|
| |
69
|
C. Procopiuc, O. Procopiuc, "Density Estimation for Spatial Data Streams", Proceedings of the 9th International Symposium on Spatial and Temporal Databases, pp.109--126, 2005.
|
 |
70
|
|
| |
71
|
|
 |
72
|
|
| |
73
|
L. Zhang, Z. Li, M. Yu, Y. Wang, Y. Jiang, "Random sampling algorithms for sliding windows over data streams", Proc. of the 11th Joint International Computer Conference, pp. 572--575, 2005.
|
 |
74
|
Haiquan (Chuck) Zhao , Ashwin Lall , Mitsunori Ogihara , Oliver Spatscheck , Jia Wang , Jun Xu, A data streaming algorithm for estimating entropies of od flows, Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, October 24-26, 2007, San Diego, California, USA
[doi> 10.1145/1298306.1298345]
|
|