|
ABSTRACT
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time, which is optimal even on the word RAM. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. J. Adler, R. E. Feldman, and M. S. Taqqu. A Practical Guide to Heavy Tails. Birkhauser, 1998.
|
| |
2
|
M. T. Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3):653--656, 1982.
|
 |
3
|
Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, On random sampling over joins, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.263-274, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
4
|
|
 |
5
|
|
| |
6
|
E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008.
|
| |
7
|
|
 |
8
|
|
| |
9
|
N. G. Duffield, C. Lund, and M. Thorup. Learn more, sample less: control of volume and variance in network measurements. IEEE Transactions on Information Theory, 51(5):1756--1775, 2005.
|
 |
10
|
|
| |
11
|
|
| |
12
|
C. T. Fan, M. E. Muller, and I. Rezucha. Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Amer. Stat. Assoc., 57:387--402, 1962.
|
| |
13
|
D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. J. Amer. Stat. Assoc., 47(260):663--685, 1952.
|
 |
14
|
|
| |
15
|
|
| |
16
|
David Moore , Vern Paxson , Stefan Savage , Colleen Shannon , Stuart Staniford , Nicholas Weaver, Inside the Slammer Worm, IEEE Security and Privacy, v.1 n.4, p.33-39, July 2003
[doi> 10.1109/MSECP.2003.1219056]
|
| |
17
|
|
| |
18
|
The Netflix Prize. http://www.netflixprize.com/.
|
| |
19
|
|
| |
20
|
C-E. Särndal, B. Swensson, and J. Wretman. Model Assisted Survey Sampling. Springer, 1992.
|
| |
21
|
|
 |
22
|
|
| |
23
|
M. Szegedy and M. Thorup. On the variance of subset sum estimation. In Proc. 15th ESA, LNCS 4698, pages 75--86, 2007.
|
 |
24
|
|
| |
25
|
Y. Tillé. An elimination procedure for unequal probability sampling without replacement. Biometrika, 83(1):238--241, 1996.
|
| |
26
|
Y. Tillé. Sampling Algorithms. 2006.
|
 |
27
|
|
|