ACM Home Page
Please provide us with feedback. Feedback
A bi-level Bernoulli scheme for database sampling
Full text PdfPdf (345 KB)
Source International Conference on Management of Data archive
Proceedings of the 2004 ACM SIGMOD international conference on Management of data table of contents
Paris, France
SESSION: Research sessions: statistics table of contents
Pages: 275 - 286  
Year of Publication: 2004
ISBN:1-58113-859-8
Authors
Peter J. Haas  IBM Almaden Research Center
Christian König  Berufsakademie Stuttgart/IBM Germany
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 47,   Citation Count: 8
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1007568.1007601
What is a DOI?

ABSTRACT

Current database sampling methods give the user insufficient control when processing ISO-style sampling queries. To address this problem, we provide a bi-level Bernoulli sampling scheme that combines the row-level and page-level sampling methods currently used in most commercial systems. By adjusting the parameters of the method, the user can systematically trade off processing speed and statistical precision---the appropriate choice of parameter settings becomes a query optimization problem. We indicate the SQL extensions needed to support bi-level sampling and determine the optimal parameter settings for an important class of sampling queries with explicit time or accuracy constraints. As might be expected, row-level sampling is preferable when data values on each page are homogeneous, whereas page-level sampling should be used when data values on a page vary widely. Perhaps surprisingly, we show that in many cases the optimal sampling policy is of the "bang-bang" type: we identify a "page-heterogeneity index" (PHI) such that optimal sampling is as "row-like" as possible if the PHI is less than 1 and as "page-like" as possible otherwise. The PHI depends upon both the query and the data, and can be estimated by means of a pilot sample. Because pilot sampling can be nontrivial to implement in commercial database systems, we also give a heuristic method for setting the sampling parameters; the method avoids pilot sampling by using a small number of summary statistics that are maintained in the system catalog. Results from over 1100 experiments on 372 real and synthetic data sets show that the heuristic method performs optimally about half of the time, and yields sampling errors within a factor of 2.2 of optimal about 93% of the time. The heuristic method is stable over a wide range of sampling rates and performs best in the most critical cases, where the data is highly clustered or skewed.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
T. F. Chan, G. H. Golub, and R. J. LeVeque, Algorithms for computing the sample variance: Analysis and recommendation. Amer. Statist., 37:242--247, 1983.
 
3
4
 
5
L. Devroye. Non-Uniform Random Variate Generation. Springer-Verlag, New York, 1986.
6
7
8
 
9
S. Lahman. The Baseball Archive Baseball Database, v4.5. www.baseball1.com.
 
10
F. Olken. Random Sampling from Databases. Ph.D. Dissertation, University of California, Berkeley, CA, 1993.
 
11
 
12
C.-E. Särndal, B. Swensson, and J. Wretman. Model Assisted Survey Sampling. Springer-Verlag, New York, 1992.
 
13
R. Winter and K. Auerbach. The big time: 1998 winter VLDB survey. Database Programming and Design, 11(8), 1998.

CITED BY  8
Collaborative Colleagues:
Peter J. Haas: colleagues
Christian König: colleagues