| Online maintenance of very large random samples |
| Full text |
Pdf
(157 KB)
|
| Source
|
International Conference on Management of Data
archive
Proceedings of the 2004 ACM SIGMOD international conference on Management of data
table of contents
Paris, France
SESSION: Research sessions: statistics
table of contents
Pages: 299 - 310
Year of Publication: 2004
ISBN:1-58113-859-8
|
|
Authors
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 4, Downloads (12 Months): 43, Citation Count: 8
|
|
|
ABSTRACT
Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a "sample" is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. Our algorithms are also suitable for biased or unequal probability sampling.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Swarup Acharya , Phillip B. Gibbons , Viswanath Poosala, Congressional samples for approximate answering of group-by queries, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.487-498, May 15-18, 2000, Dallas, Texas, United States
|
 |
2
|
Swarup Acharya , Phillip B. Gibbons , Viswanath Poosala , Sridhar Ramaswamy, Join synopses for approximate query answering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.275-286, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
3
|
|
 |
4
|
|
 |
5
|
Surajit Chaudhuri , Gautam Das , Vivek Narasayya, A robust, optimization-based approach for approximate answering of aggregate queries, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.295-306, May 21-24, 2001, Santa Barbara, California, United States
|
| |
6
|
|
| |
7
|
W. Cochran. Sampling Techniques. Wiley & Sons, 1977
|
 |
8
|
|
 |
9
|
|
 |
10
|
Dimitrios Gunopulos , George Kollios , Vassilis J. Tsotras , Carlotta Domeniconi, Approximating multi-dimensional aggregate range queries over real attributes, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.463-474, May 15-18, 2000, Dallas, Texas, United States
|
| |
11
|
C. Fan, M. Muller, I. Rezucha: Development of sampling plans by using sequential (item by item) techniques and digital computers. J. Amer. Stat. Assoc. 57: 387--402 (1962)
|
 |
12
|
Sumit Ganguly , Phillip B. Gibbons , Yossi Matias , Avi Silberschatz, Bifocal sampling for skew-resistant join size estimation, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.271-281, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
13
|
|
 |
14
|
Johannes Gehrke , Flip Korn , Divesh Srivastava, On computing correlated aggregates over continual data streams, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.13-24, May 21-24, 2001, Santa Barbara, California, United States
|
 |
15
|
|
 |
16
|
|
| |
17
|
|
 |
18
|
Joseph M. Hellerstein , Peter J. Haas , Helen J. Wang, Online aggregation, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.171-182, May 11-15, 1997, Tucson, Arizona, United States
|
| |
19
|
C. Jermaine: Robust Estimation With Sampling and Approximate Pre-Aggregation. VLDB 2003: 886--897
|
| |
20
|
T. Joens: A note on sampling from a tape file. Comm. ACM: 5, 343 (1964)
|
| |
21
|
N. L. Johnson and S. Kotz: Discrete Distributions. Houghton Mifflin, 1969.
|
| |
22
|
|
| |
23
|
|
 |
24
|
Frank Olken , Doron Rotem , Ping Xu, Random sampling from hash files, Proceedings of the 1990 ACM SIGMOD international conference on Management of data, p.375-386, May 23-26, 1990, Atlantic City, New Jersey, United States
|
| |
25
|
J. Shao: Mathematical Statistics. Springer-Verlag, 1999
|
| |
26
|
|
 |
27
|
|
 |
28
|
|
 |
29
|
|
CITED BY 8
|
|
|
|
|
H. Wang , S. Parthasarathy , A. Ghoting , S. Tatikonda , G. Buehrer , T. Kurc , J. Saltz, Design of a next generation sampling service for large scale data analysis applications, Proceedings of the 19th annual international conference on Supercomputing, June 20-22, 2005, Cambridge, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|