ACM Home Page
Please provide us with feedback. Feedback
Confidence bounds for sampling-based group by estimates
Full text PdfPdf (1.52 MB)
Source
ACM Transactions on Database Systems (TODS) archive
Volume 33 ,  Issue 3  (August 2008) table of contents
Article No. 16  
Year of Publication: 2008
ISSN:0362-5915
Authors
Fei Xu  University of Florida, Gainesville, Gainesville, FL
Christopher Jermaine  University of Florida, Gainesville, Gainesville, FL
Alin Dobra  University of Florida, Gainesville, Gainesville, FL
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 145,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1386118.1386122
What is a DOI?

ABSTRACT

Sampling is now a very important data management tool, to such an extent that an interface for database sampling is included in the latest SQL standard. In this article we reconsider in depth what at first may seem like a very simple problem—computing the error of a sampling-based guess for the answer to a GROUP BY query over a multitable join. The difficulty when sampling for the answer to such a query is that the same sample will be used to guess the result of the query for each group, which induces correlations among the estimates. Thus, from a statistical point-of-view it is very problematic and even dangerous to use traditional methods such as confidence intervals for communicating estimate accuracy to the user. We explore ways to address this problem, and pay particular attention to the computational aspects of computing “safe” confidence intervals.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Royal Statisti. Soc. 57, 289--300.
 
5
Casella, G. and Berger, R. L. 2002. Statistical Inference. 2nd Ed. Duxbury. CAS g2 02:1 1.Ex.
6
7
8
 
9
Dragici, S. 2003. Data Analysis Tools for DNA Microarrays. Chapman and Hall, CRC Press.
 
10
11
12
 
13
14
 
15
Hochberg, Y. 1988. A sharper bonferroni procedure for multiple tests of significance. Biometrika 75, 800--802.
 
16
 
17
Holm, S. 1979. A simple sequentially rejective multiple test procedure. Scand. J. Stat 6, 65--70.
18
19
 
20
Hsu, J. 1996. Multiple Comparisons: Theory and Methods. Chapman and Hall, CRC Press.
21
 
22
Johnson, N. L., Kotz, S., and Balakrishnan, N. 1995. Continuous Univariate Distributions Vol. 2, Wiley, New York.
23
 
24
Miller, R. G. 1981. Simultaneous Statistical Inference, 2nd ed. Springer, Berlin, Germany.
 
25
26
 
27
 
28
Sarndal, C., Swensson, B., and Wretman, J. 1992. Model Assisted Survey Sampling. Springer, Berlin, Germany.
 
29
Storey, J. D. 2002. A direct approach to false discovery rates. J. Royal Statist. Soc. Series B 64, 479--498.
 
30
Westfall, P. and Young, S. 1993. Resampling-Based Multiple Testing. Wiley, New York.

Collaborative Colleagues:
Fei Xu: colleagues
Christopher Jermaine: colleagues
Alin Dobra: colleagues