|
ABSTRACT
The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques for the case in which the dataset of interest is split into partitions. We create for each partition a synopsis that can be used to estimate the number of DVs in the partition. By combining and extending a number of results in the literature, we obtain both suitable synopses and DV estimators. The synopses can be created in parallel, and can be easily combined to yield synopses and DV estimates for "compound" partitions that are created from the base partitions via arbitrary multiset union, intersection, or difference operations. Our synopses can also handle deletions of individual partition elements. We prove that our DV estimators are unbiased, provide error bounds, and show how to select synopsis sizes in order to achieve a desired estimation accuracy. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Astrahan, M., Schkolnick, M., Whang, K. Approximating the number of unique values of an attribute without sorting. Inf. Sys. 12 (1987), 11--15.
|
| |
2
|
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L. Counting distinct elements in a data stream. In Proc. RANDOM (2002), 1--10.
|
| |
3
|
Beyer, K.S., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R. On synopses for distinct-value estimation under multiset operations. In Proc. ACM SIGMOD (2007), 199--210.
|
| |
4
|
Brown, P.G., Haas, P.J. Techniques for warehousing of sample data. In Proc. ICDE (2006).
|
| |
5
|
Cohen, E., Kaplan, H. Tighter estimation using bottom k sketches. Proc. VLDB Endow. 1, 1 (2008), 213--224.
|
| |
6
|
Dasu, T., Johnson, T., Muthukrishnan, S., Shkapenyuk, V. Mining database structure; or, how to build a data quality browser. In Proc. ACM SIGMOD (2002), 240--251.
|
| |
7
|
Duffield, N., Lund, C., Thorup, M. Priority sampling for estimation of arbitrary subset sums. J. ACM 54, 6 (2007), 32.
|
| |
8
|
Durand, M., Flajolet, P. Loglog counting of large cardinalities. In Proc. ESA (2003), 605--617.
|
| |
9
|
Flajolet, P., Martin, G.N. Probabilistic counting algorithms for data base applications. J. Comp. Sys. Sci. 31 (1985), 182--209.
|
| |
10
|
Ganguly, S., Garofalakis, M., Rastogi, R. Tracking set-expression cardinalities over continuous update streams. VLDB J. 13 (2004), 354--369.
|
| |
11
|
Gemulla, R. Sampling Algorithms for Evolving Datasets. Ph.D. thesis, TU Dresden, Dept. of CS, 2008. http://nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184-11644.
|
| |
12
|
Gemulla, R., Lehner, W. Sampling time-based sliding windows in bounded space. In Proc. SIGMOD (2008), 379--392.
|
| |
13
|
Gibbons, P. Distinct-values estimation over data streams. In M. Garofalakis, J. Gehrke, and R. Rastogi, editors, Data Stream Management: Processing High-Speed Data Streams. Springer, 2009. To appear.
|
| |
14
|
Haas, P.J., Stokes, L. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc. 93 (1998), 1475--1487.
|
| |
15
|
Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D. Hashed samples: selectivity estimators for set similarity selection queries. Proc. VLDB Endow. 1, 1 (2008), 201--212.
|
| |
16
|
Motwani, R., Raghavan, P. Randomized Algorithms. Cambridge University Press (1995).
|
| |
17
|
Serfling, R.J. Approximation Theorems of Mathematical Statistics. Wiley, New York (1980).
|
| |
18
|
Shukla, A., Deshpande, P., Naughton, J.F., Ramasamy, K. Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proc. VLDB (1996), 522--531.
|
| |
19
|
Simitsis, A., Baid, A., Sismanis, Y., Reinwald, B. Multidimensional content eXploration. Proc. VLDB Endow. 1, 1 (2008), 660--671.
|
| |
20
|
Szegedy, M. The DLT priority sampling is essentially optimal. In Proc. STOC (2006), 150--158.
|
| |
21
|
Whang, K., Vander-Zanden, B.T., Taylor, H.M. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Sys. 15 (1990), 208--229.
|
| |
22
|
Zimmer, C., Tryfonopoulos, C., Weikum, G. Exploiting correlated keywords to improve approximate information filtering. In Proc. SIGIR (2008), 323--330.
|
|