| Confidence bounds for sampling-based group by estimates |
| Full text |
Pdf
(1.52 MB)
|
Source
|
ACM Transactions on Database Systems (TODS)
archive
Volume 33 , Issue 3 (August 2008)
table of contents
Article No. 16
Year of Publication: 2008
ISSN:0362-5915
|
|
Authors
|
|
Fei Xu
|
University of Florida, Gainesville, Gainesville, FL
|
|
Christopher Jermaine
|
University of Florida, Gainesville, Gainesville, FL
|
|
Alin Dobra
|
University of Florida, Gainesville, Gainesville, FL
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 14, Downloads (12 Months): 145, Citation Count: 0
|
|
|
ABSTRACT
Sampling is now a very important data management tool, to such an extent that an interface for database sampling is included in the latest SQL standard. In this article we reconsider in depth what at first may seem like a very simple problem—computing the error of a sampling-based guess for the answer to a GROUP BY query over a multitable join. The difficulty when sampling for the answer to such a query is that the same sample will be used to guess the result of the query for each group, which induces correlations among the estimates. Thus, from a statistical point-of-view it is very problematic and even dangerous to use traditional methods such as confidence intervals for communicating estimate accuracy to the user. We explore ways to address this problem, and pay particular attention to the computational aspects of computing “safe” confidence intervals.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Swarup Acharya , Phillip B. Gibbons , Viswanath Poosala , Sridhar Ramaswamy, Join synopses for approximate query answering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.275-286, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
2
|
Swarup Acharya , Phillip B. Gibbons , Viswanath Poosala , Sridhar Ramaswamy, The Aqua approximate query answering system, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.574-576, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
3
|
Swarup Acharya , Phillip B. Gibbons , Viswanath Poosala, Congressional samples for approximate answering of group-by queries, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.487-498, May 15-18, 2000, Dallas, Texas, United States
|
| |
4
|
Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Royal Statisti. Soc. 57, 289--300.
|
| |
5
|
Casella, G. and Berger, R. L. 2002. Statistical Inference. 2nd Ed. Duxbury. CAS g2 02:1 1.Ex.
|
 |
6
|
Moses Charikar , Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Towards estimation error guarantees for distinct values, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.268-279, May 15-18, 2000, Dallas, Texas, United States
[doi> 10.1145/335168.335230]
|
 |
7
|
Surajit Chaudhuri , Gautam Das , Vivek Narasayya, A robust, optimization-based approach for approximate answering of aggregate queries, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.295-306, May 21-24, 2001, Santa Barbara, California, United States
|
 |
8
|
|
| |
9
|
Dragici, S. 2003. Data Analysis Tools for DNA Microarrays. Chapman and Hall, CRC Press.
|
| |
10
|
|
 |
11
|
|
 |
12
|
|
| |
13
|
Joseph M. Hellerstein , Ron Avnur , Andy Chou , Christian Hidber , Chris Olston , Vijayshankar Raman , Tali Roth , Peter J. Haas, Interactive Data Analysis: The Control Project, Computer, v.32 n.8, p.51-59, August 1999
[doi> 10.1109/2.781635]
|
 |
14
|
Joseph M. Hellerstein , Peter J. Haas , Helen J. Wang, Online aggregation, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.171-182, May 11-15, 1997, Tucson, Arizona, United States
|
| |
15
|
Hochberg, Y. 1988. A sharper bonferroni procedure for multiple tests of significance. Biometrika 75, 800--802.
|
| |
16
|
|
| |
17
|
Holm, S. 1979. A simple sequentially rejective multiple test procedure. Scand. J. Stat 6, 65--70.
|
 |
18
|
|
 |
19
|
|
| |
20
|
Hsu, J. 1996. Multiple Comparisons: Theory and Methods. Chapman and Hall, CRC Press.
|
 |
21
|
Christopher Jermaine , Alin Dobra , Subramanian Arumugam , Shantanu Joshi , Abhijit Pol, A disk-based join with probabilistic guarantees, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
[doi> 10.1145/1066157.1066222]
|
| |
22
|
Johnson, N. L., Kotz, S., and Balakrishnan, N. 1995. Continuous Univariate Distributions Vol. 2, Wiley, New York.
|
 |
23
|
Richard J. Lipton , Jeffrey F. Naughton , Donovan A. Schneider, Practical selectivity estimation through adaptive sampling, Proceedings of the 1990 ACM SIGMOD international conference on Management of data, p.1-11, May 23-26, 1990, Atlantic City, New Jersey, United States
|
| |
24
|
Miller, R. G. 1981. Simultaneous Statistical Inference, 2nd ed. Springer, Berlin, Germany.
|
| |
25
|
|
 |
26
|
Frank Olken , Doron Rotem , Ping Xu, Random sampling from hash files, Proceedings of the 1990 ACM SIGMOD international conference on Management of data, p.375-386, May 23-26, 1990, Atlantic City, New Jersey, United States
|
| |
27
|
|
| |
28
|
Sarndal, C., Swensson, B., and Wretman, J. 1992. Model Assisted Survey Sampling. Springer, Berlin, Germany.
|
| |
29
|
Storey, J. D. 2002. A direct approach to false discovery rates. J. Royal Statist. Soc. Series B 64, 479--498.
|
| |
30
|
Westfall, P. and Young, S. 1993. Resampling-Based Multiple Testing. Wiley, New York.
|
|