|
ABSTRACT
We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237814.237823]
|
| |
2
|
|
| |
3
|
J. Bunge and M. Fitzpatrick. Estimating the Number of Species: A Review. Journal of the American Statistical Association 88(1993): 364-373.
|
| |
4
|
K.P. Burnham and W.S. Overton. Estimation of the size of a closed population when capture possibilities vary among animals. Biometrika 65(1978): 625-633.
|
| |
5
|
K.P. Burnham and W.S. Overton. Robust estimation of population size when capture possibilities vary among animals. Ecology 60(1979): 927-936.
|
| |
6
|
A. Chao. Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistical Theory and Applications 11(1984): 265-270.
|
| |
7
|
A. Chao and S. Lee. Estimating the number of classes via sample coverage. Journal of the American Statistical Association 87(1992): 210-217.
|
 |
8
|
Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Random sampling for histogram construction: how much is enough?, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.436-447, June 01-04, 1998, Seattle, Washington, United States
|
 |
9
|
Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, On random sampling over joins, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.263-274, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
10
|
|
 |
11
|
|
| |
12
|
P. Flajolet and G.N. Martin. Probabilistic counting. In Proceedings of the IEEE Symposium on the Foundations of Computer Science, 1983, pp. 76-82.
|
| |
13
|
Information and Computer Science, University of California, Irvine. ftp://ftp.ics.uci.edu/pub/machinele arnin g-d at abases
|
| |
14
|
L.A. Goodman. On the estimation of the number of classes in a population. Annals of Mathematical Statistics 20(1949): 572-579.
|
| |
15
|
|
| |
16
|
P.J. Haas and L. Stokes. Estimating the number of classes in a finite population. In Journal of the American Statistical Association, 93(1998): 1475-1487.
|
| |
17
|
J.F. Heltshe and N.E. Forrester. Estimating species richness using the jackknife procedure. Biometrics 39(1983): 1-11.
|
 |
18
|
|
 |
19
|
|
| |
20
|
D.E. Knuth. Sorting and Searching, Volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, MA, 1971.
|
| |
21
|
|
| |
22
|
|
| |
23
|
F. Olken. Random Sampling from Databases. PhD Thesis, Computer Science, U.C. Berkeley, 1993.
|
| |
24
|
F. Olken and D. Rotem. Random Sampling from Databases- A Survey. Manuscript, 1995.
|
| |
25
|
G. Ozsoyoglu, K. Du, a. Tjahjana, W. Hou, and D.Y. Rowland. On estimating COUNT, SUM, and AV- ERAGE relational algebra queries. In Proceedings of the Conference on Database and Ezpert Systems Applications, pages 406-412, 1991.
|
 |
26
|
Viswanath Poosala , Peter J. Haas , Yannis E. Ioannidis , Eugene J. Shekita, Improved histograms for selectivity estimation of range predicates, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.294-305, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
27
|
A. Shlosser. On estimation of the size of the dictionary of a long text on the basis of a sample. Engineering Cybernetics 19(1981): 97-102.
|
| |
28
|
|
| |
29
|
E.P. Smith and G. Belle. Nonparametric estimation of species richness. Biometrics 40(1984):119-129.
|
 |
30
|
|
| |
31
|
G.E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949.
|
CITED BY 43
|
|
|
|
|
Anna C. Gilbert , Yannis Kotidis , S. Muthukrishnan , Marin J. Strauss, Optimal and approximate computation of summary statistics for range aggregates, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.227-236, May 2001, Santa Barbara, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carlo Dell'aquila , Ezio Lefons , Filippo Tangorra, Analytic-based estimation of query result sizes, Proceedings of the 4th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering Data Bases, p.1-7, February 13-15, 2005, Salzburg, Austria
|
|
|
|
|
|
|
|
|
Kevin Beyer , Peter J. Haas , Berthold Reinwald , Yannis Sismanis , Rainer Gemulla, On synopses for distinct-value estimation under multiset operations, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
|
|
|
|
|
|
Graham Cormode , Mayur Datar , Piotr Indyk , S. Muthukrishnan, Comparing data streams using Hamming norms (how to zero in), Proceedings of the 28th international conference on Very Large Data Bases, p.335-345, August 20-23, 2002, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sunil Chakkappen , Thierry Cruanes , Benoit Dageville , Linan Jiang , Uri Shaft , Hong Su , Mohamed Zait, Efficient and scalable statistics gathering for large databases in Oracle 11g, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|