ACM Home Page
Please provide us with feedback. Feedback
Online maintenance of very large random samples
Full text PdfPdf (157 KB)
Source International Conference on Management of Data archive
Proceedings of the 2004 ACM SIGMOD international conference on Management of data table of contents
Paris, France
SESSION: Research sessions: statistics table of contents
Pages: 299 - 310  
Year of Publication: 2004
ISBN:1-58113-859-8
Authors
Christopher Jermaine  University of Florida, Gainesville, FL
Abhijit Pol  University of Florida, Gainesville, FL
Subramanian Arumugam  University of Florida, Gainesville, FL
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 43,   Citation Count: 8
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1007568.1007603
What is a DOI?

ABSTRACT

Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a "sample" is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. Our algorithms are also suitable for biased or unequal probability sampling.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
4
5
 
6
 
7
W. Cochran. Sampling Techniques. Wiley & Sons, 1977
8
9
10
 
11
C. Fan, M. Muller, I. Rezucha: Development of sampling plans by using sequential (item by item) techniques and digital computers. J. Amer. Stat. Assoc. 57: 387--402 (1962)
12
 
13
14
15
16
 
17
18
 
19
C. Jermaine: Robust Estimation With Sampling and Approximate Pre-Aggregation. VLDB 2003: 886--897
 
20
T. Joens: A note on sampling from a tape file. Comm. ACM: 5, 343 (1964)
 
21
N. L. Johnson and S. Kotz: Discrete Distributions. Houghton Mifflin, 1969.
 
22
 
23
24
 
25
J. Shao: Mathematical Statistics. Springer-Verlag, 1999
 
26
27
28
29

CITED BY  8
Collaborative Colleagues:
Christopher Jermaine: colleagues
Abhijit Pol: colleagues
Subramanian Arumugam: colleagues