ACM Home Page
Please provide us with feedback. Feedback
Distinct-value synopses for multiset operations
Full text Digital EditionDigital Edition HtmlHtml (7 KB),  PdfPdf (857 KB)
Source
Communications of the ACM archive
Volume 52 ,  Issue 10  (October 2009) table of contents
A View of Parallel Computing
SECTION: Research highlights table of contents
Pages 87-95  
Year of Publication: 2009
ISSN:0001-0782
Authors
Kevin Beyer  IBM Almaden Research Center, San Jose, CA.
Rainer Gemulla  IBM Almaden Research Center, San Jose, CA.
Peter J. Haas  IBM Almaden Research Center, San Jose, CA.
Berthold Reinwald  IBM Almaden Research Center, San Jose, CA.
Yannis Sismanis  IBM Almaden Research Center, San Jose, CA.
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 106,   Downloads (12 Months): 106,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1562764.1562787
What is a DOI?

ABSTRACT

The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques for the case in which the dataset of interest is split into partitions. We create for each partition a synopsis that can be used to estimate the number of DVs in the partition. By combining and extending a number of results in the literature, we obtain both suitable synopses and DV estimators. The synopses can be created in parallel, and can be easily combined to yield synopses and DV estimates for "compound" partitions that are created from the base partitions via arbitrary multiset union, intersection, or difference operations. Our synopses can also handle deletions of individual partition elements. We prove that our DV estimators are unbiased, provide error bounds, and show how to select synopsis sizes in order to achieve a desired estimation accuracy. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Astrahan, M., Schkolnick, M., Whang, K. Approximating the number of unique values of an attribute without sorting. Inf. Sys. 12 (1987), 11--15.
 
2
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L. Counting distinct elements in a data stream. In Proc. RANDOM (2002), 1--10.
 
3
Beyer, K.S., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R. On synopses for distinct-value estimation under multiset operations. In Proc. ACM SIGMOD (2007), 199--210.
 
4
Brown, P.G., Haas, P.J. Techniques for warehousing of sample data. In Proc. ICDE (2006).
 
5
Cohen, E., Kaplan, H. Tighter estimation using bottom k sketches. Proc. VLDB Endow. 1, 1 (2008), 213--224.
 
6
Dasu, T., Johnson, T., Muthukrishnan, S., Shkapenyuk, V. Mining database structure; or, how to build a data quality browser. In Proc. ACM SIGMOD (2002), 240--251.
 
7
Duffield, N., Lund, C., Thorup, M. Priority sampling for estimation of arbitrary subset sums. J. ACM 54, 6 (2007), 32.
 
8
Durand, M., Flajolet, P. Loglog counting of large cardinalities. In Proc. ESA (2003), 605--617.
 
9
Flajolet, P., Martin, G.N. Probabilistic counting algorithms for data base applications. J. Comp. Sys. Sci. 31 (1985), 182--209.
 
10
Ganguly, S., Garofalakis, M., Rastogi, R. Tracking set-expression cardinalities over continuous update streams. VLDB J. 13 (2004), 354--369.
 
11
Gemulla, R. Sampling Algorithms for Evolving Datasets. Ph.D. thesis, TU Dresden, Dept. of CS, 2008. http://nbn-resolving.de/urn:nbn:de:bsz:14-ds-1224861856184-11644.
 
12
Gemulla, R., Lehner, W. Sampling time-based sliding windows in bounded space. In Proc. SIGMOD (2008), 379--392.
 
13
Gibbons, P. Distinct-values estimation over data streams. In M. Garofalakis, J. Gehrke, and R. Rastogi, editors, Data Stream Management: Processing High-Speed Data Streams. Springer, 2009. To appear.
 
14
Haas, P.J., Stokes, L. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc. 93 (1998), 1475--1487.
 
15
Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D. Hashed samples: selectivity estimators for set similarity selection queries. Proc. VLDB Endow. 1, 1 (2008), 201--212.
 
16
Motwani, R., Raghavan, P. Randomized Algorithms. Cambridge University Press (1995).
 
17
Serfling, R.J. Approximation Theorems of Mathematical Statistics. Wiley, New York (1980).
 
18
Shukla, A., Deshpande, P., Naughton, J.F., Ramasamy, K. Storage estimation for multidimensional aggregates in the presence of hierarchies. In Proc. VLDB (1996), 522--531.
 
19
Simitsis, A., Baid, A., Sismanis, Y., Reinwald, B. Multidimensional content eXploration. Proc. VLDB Endow. 1, 1 (2008), 660--671.
 
20
Szegedy, M. The DLT priority sampling is essentially optimal. In Proc. STOC (2006), 150--158.
 
21
Whang, K., Vander-Zanden, B.T., Taylor, H.M. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Sys. 15 (1990), 208--229.
 
22
Zimmer, C., Tryfonopoulos, C., Weikum, G. Exploiting correlated keywords to improve approximate information filtering. In Proc. SIGIR (2008), 323--330.