|
ABSTRACT
The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques that are designed for use within a flexible and scalable "synopsis warehouse" architecture. In this setting, incoming data is split into partitions and a synopsis is created for each partition; each synopsis can then be used to quickly estimate the number of DVs in its corresponding partition. By combining and extending a number of results in the literature, we obtain both appropriate synopses and novel DV estimators to use in conjunction with these synopses. Our synopses can be created in parallel, and can then be easily combined to yield synopses and DV estimates for arbitrary unions, intersections or differences of partitions. Our synopses can also handle deletions of individual partition elements. We use the theory of order statistics to show that our DV estimators are unbiased, and to establish moment formulas and sharp error bounds. Based on a novel limit theorem, we can exploit results due to Cohen in order to select synopsis sizes when initially designing the warehouse. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, pages 161--180. Springer, 2005.
|
| |
5
|
P. G. Brown and P. J. Haas. Techniques for warehousing of sample data. In Proc. ICDE, 2006.
|
 |
6
|
Moses Charikar , Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Towards estimation error guarantees for distinct values, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.268-279, May 15-18, 2000, Dallas, Texas, United States
[doi> 10.1145/335168.335230]
|
| |
7
|
|
 |
8
|
Tamraparni Dasu , Theodore Johnson , S. Muthukrishnan , Vladislav Shkapenyuk, Mining database structure; or, how to build a data quality browser, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, June 03-06, 2002, Madison, Wisconsin
[doi> 10.1145/564691.564719]
|
| |
9
|
H. A. David and H. N. Nagaraja. Order Statistics. Wiley, third edition, 2003.
|
 |
10
|
|
| |
11
|
M. Durand and P. Flajolet. Loglog counting of large cardinalities. In Proc. 11th Eur. Symp. Algorithms (ESA 2003), volume 2832 of Lecture Notes in Computer Science. Springer, 2003.
|
| |
12
|
C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high speed links. In Proc. SIGCOMM '02, pages 323--336, 2002.
|
| |
13
|
P. Flajolet. Adaptive sampling In M. Hazewinkel, editor, Encyclopaedia of Mathematics, Supplement I. Kluwer, 1997.
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
| |
18
|
F. Giroire.Order statistics and estimating cardinalities of massive data sets. In Proc. Intl. Conf. Analysis Algorithms, pages 157--166, 2005.
|
| |
19
|
P. J. Haas, Y. Liu, and L. Stokes. An estimator of the number of species from quadrat sampling. Biometrics, 62:135--141, 2006.
|
| |
20
|
P. J. Haas and L. Stokes. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc., 93:1475--1487, 1998.
|
 |
21
|
|
| |
22
|
Y. E. Ioannidis. The history of histograms (abridged). In Proc. VLDB, pages 19--30, 2003.
|
| |
23
|
N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univeriate Distributions-2. Wiley, 2nd edition, 1995.
|
| |
24
|
S. Karlin and H. M. Taylor. A Second Course in Stochastic Processes. Academic Press, 1981.
|
| |
25
|
D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, 1973.
|
 |
26
|
|
| |
27
|
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
|
 |
28
|
Sriram Padmanabhan , Bishwaranjan Bhattacharjee , Tim Malkemus , Leslie Cranston , Matthew Huras, Multi-dimensional clustering: a new data layout scheme in DB2, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California
[doi> 10.1145/872757.872835]
|
 |
29
|
P. Griffiths Selinger , M. M. Astrahan , D. D. Chamberlin , R. A. Lorie , T. G. Price, Access path selection in a relational database management system, Proceedings of the 1979 ACM SIGMOD international conference on Management of data, May 30-June 01, 1979, Boston, Massachusetts
[doi> 10.1145/582095.582099]
|
| |
30
|
R. J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.
|
| |
31
|
|
 |
32
|
|
 |
33
|
|
CITED BY 9
|
|
|
|
|
Sunil Chakkappen , Thierry Cruanes , Benoit Dageville , Linan Jiang , Uri Shaft , Hong Su , Mohamed Zait, Efficient and scalable statistics gathering for large databases in Oracle 11g, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|