ACM Home Page
Please provide us with feedback. Feedback
On synopses for distinct-value estimation under multiset operations
Full text PdfPdf (309 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2007 ACM SIGMOD international conference on Management of data table of contents
Beijing, China
SESSION: Approximate query processing table of contents
Pages: 199 - 210  
Year of Publication: 2007
ISBN:978-1-59593-686-8
Authors
Kevin Beyer  IBM Almaden Research Center, San Jose, CA
Peter J. Haas  IBM Almaden Research Center, San Jose, CA
Berthold Reinwald  IBM Almaden Research Center, San Jose, CA
Yannis Sismanis  IBM Almaden Research Center, San Jose, CA
Rainer Gemulla  Technische Universität Dresden, Dresden, Germany
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 22,   Downloads (12 Months): 186,   Citation Count: 9
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1247480.1247504
What is a DOI?

ABSTRACT

The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques that are designed for use within a flexible and scalable "synopsis warehouse" architecture. In this setting, incoming data is split into partitions and a synopsis is created for each partition; each synopsis can then be used to quickly estimate the number of DVs in its corresponding partition. By combining and extending a number of results in the literature, we obtain both appropriate synopses and novel DV estimators to use in conjunction with these synopses. Our synopses can be created in parallel, and can then be easily combined to yield synopses and DV estimates for arbitrary unions, intersections or differences of partitions. Our synopses can also handle deletions of individual partition elements. We use the theory of order statistics to show that our DV estimators are unbiased, and to establish moment formulas and sharp error bounds. Based on a novel limit theorem, we can exploit results due to Cohen in order to select synopsis sizes when initially designing the warehouse. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
P. Brown, P. J. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald, and Y. Sismanis. Toward automated large-scale information integration and discovery. In Data Management in a Connected World, pages 161--180. Springer, 2005.
 
5
P. G. Brown and P. J. Haas. Techniques for warehousing of sample data. In Proc. ICDE, 2006.
6
 
7
8
 
9
H. A. David and H. N. Nagaraja. Order Statistics. Wiley, third edition, 2003.
10
 
11
M. Durand and P. Flajolet. Loglog counting of large cardinalities. In Proc. 11th Eur. Symp. Algorithms (ESA 2003), volume 2832 of Lecture Notes in Computer Science. Springer, 2003.
 
12
C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high speed links. In Proc. SIGCOMM '02, pages 323--336, 2002.
 
13
P. Flajolet. Adaptive sampling In M. Hazewinkel, editor, Encyclopaedia of Mathematics, Supplement I. Kluwer, 1997.
 
14
 
15
 
16
17
 
18
F. Giroire.Order statistics and estimating cardinalities of massive data sets. In Proc. Intl. Conf. Analysis Algorithms, pages 157--166, 2005.
 
19
P. J. Haas, Y. Liu, and L. Stokes. An estimator of the number of species from quadrat sampling. Biometrics, 62:135--141, 2006.
 
20
P. J. Haas and L. Stokes. Estimating the number of classes in a finite population. J. Amer. Statist. Assoc., 93:1475--1487, 1998.
21
 
22
Y. E. Ioannidis. The history of histograms (abridged). In Proc. VLDB, pages 19--30, 2003.
 
23
N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univeriate Distributions-2. Wiley, 2nd edition, 1995.
 
24
S. Karlin and H. M. Taylor. A Second Course in Stochastic Processes. Academic Press, 1981.
 
25
D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, 1973.
26
 
27
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
28
29
 
30
R. J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.
 
31
32
33

CITED BY  9

Collaborative Colleagues:
Kevin Beyer: colleagues
Peter J. Haas: colleagues
Berthold Reinwald: colleagues
Yannis Sismanis: colleagues
Rainer Gemulla: colleagues