ACM Home Page
Please provide us with feedback. Feedback
Towards estimation error guarantees for distinct values
Full text PdfPdf (229 KB)
Source Symposium on Principles of Database Systems archive
Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems table of contents
Dallas, Texas, United States
Pages: 268 - 279  
Year of Publication: 2000
ISBN:1-58113-214-X
Authors
Moses Charikar  Department of Computer Science, Gates 4B, Stanford University, Stanford, CA
Surajit Chaudhuri  Microsoft Research, One Microsoft Way, Redmond, WA
Rajeev Motwani  Department of Computer Science, Gates 4B, Stanford University, Stanford, CA
Vivek Narasayya  Microsoft Research, One Microsoft Way, Redmond, WA
Sponsor
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 61,   Citation Count: 43
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/335168.335230
What is a DOI?

ABSTRACT

We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
J. Bunge and M. Fitzpatrick. Estimating the Number of Species: A Review. Journal of the American Statistical Association 88(1993): 364-373.
 
4
K.P. Burnham and W.S. Overton. Estimation of the size of a closed population when capture possibilities vary among animals. Biometrika 65(1978): 625-633.
 
5
K.P. Burnham and W.S. Overton. Robust estimation of population size when capture possibilities vary among animals. Ecology 60(1979): 927-936.
 
6
A. Chao. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistical Theory and Applications 11(1984): 265-270.
 
7
A. Chao and S. Lee. Estimating the number of classes via sample coverage. Journal of the American Statistical Association 87(1992): 210-217.
8
9
 
10
11
 
12
P. Flajolet and G.N. Martin. Probabilistic counting. In Proceedings of the IEEE Symposium on the Foundations of Computer Science, 1983, pp. 76-82.
 
13
Information and Computer Science, University of California, Irvine. ftp://ftp.ics.uci.edu/pub/machinele arnin g-d at abases
 
14
L.A. Goodman. On the estimation of the number of classes in a population. Annals of Mathematical Statistics 20(1949): 572-579.
 
15
 
16
P.J. Haas and L. Stokes. Estimating the number of classes in a finite population. In Journal of the American Statistical Association, 93(1998): 1475-1487.
 
17
J.F. Heltshe and N.E. Forrester. Estimating species richness using the jackknife procedure. Biometrics 39(1983): 1-11.
18
19
 
20
D.E. Knuth. Sorting and Searching, Volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, MA, 1971.
 
21
 
22
 
23
F. Olken. Random Sampling from Databases. PhD Thesis, Computer Science, U.C. Berkeley, 1993.
 
24
F. Olken and D. Rotem. Random Sampling from Databases- A Survey. Manuscript, 1995.
 
25
G. Ozsoyoglu, K. Du, a. Tjahjana, W. Hou, and D.Y. Rowland. On estimating COUNT, SUM, and AV- ERAGE relational algebra queries. In Proceedings of the Conference on Database and Ezpert Systems Applications, pages 406-412, 1991.
26
 
27
A. Shlosser. On estimation of the size of the dictionary of a long text on the basis of a sample. Engineering Cybernetics 19(1981): 97-102.
 
28
 
29
E.P. Smith and G. Belle. Nonparametric estimation of species richness. Biometrics 40(1984):119-129.
30
 
31
G.E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949.

CITED BY  43

Collaborative Colleagues:
Moses Charikar: colleagues
Surajit Chaudhuri: colleagues
Rajeev Motwani: colleagues
Vivek Narasayya: colleagues