ACM Home Page
Please provide us with feedback. Feedback
Statistical analysis of sketch estimators
Full text PdfPdf (487 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2007 ACM SIGMOD international conference on Management of data table of contents
Beijing, China
SESSION: Approximate query processing table of contents
Pages: 187 - 198  
Year of Publication: 2007
ISBN:978-1-59593-686-8
Authors
Florin Rusu  University of Florida, Gainesville, FL
Alin Dobra  University of Florida, Gainesville, FL
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 95,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1247480.1247503
What is a DOI?

ABSTRACT

Sketching techniques can provide approximate answers to aggregate queries either for data-streaming or distributed computation. Small space summaries that have linearity properties are required for both types of applications. The prevalent method for analyzing sketches uses moment analysis and distribution independent bounds based on moments. This method produces clean, easy to interpret, theoretical bounds that are especially useful for deriving asymptotic results. However, the theoretical bounds obscure fine details of the behavior of various sketches and they are mostly not indicative of which type of sketches should be used in practice. Moreover, no significant empirical comparison between various sketching techniques has been published, which makes the choice even harder. In this paper, we take a close look at the sketching techniques proposed in the literature from a statistical point of view with the goal of determining properties that indicate the actual behavior and producing tighter confidence bounds. Interestingly, the statistical analysis reveals that two of the techniques, Fast-AGMS and Count-Min, provide results that are in some cases orders of magnitude better than the corresponding theoretical predictions. We conduct an extensive empirical study that compares the different sketching techniques in order to corroborate the statistical analysis with the conclusions we draw from it. The study indicates the expected performance of various sketches, which is crucial if the techniques are to be used by practitioners. The overall conclusion of the study is that Fast-AGMS sketches are, for the full spectrum of problems, either the best, or close to the best, sketching technique. This makes Fast-AGMS sketches the preferred choice irrespective of the situation.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. J. Comput. Syst. Sci., 64(3):719--747, 2002.
2
 
3
K. P. Balanda and H. L. MacGillivray. Kurtosis: A critical review. J. American Statistician, 42(2):111--119, 1988.
 
4
 
5
 
6
 
7
Current Population Survey (CPS). http://www.census.gov/cps.
8
 
9
 
10
S. Ganguly, M. Garofalakis, and R. Rastogi. Processing data-stream join aggregates using skimmed sketches. In EDBT 2004, pages 569--586.
 
11
S. Ganguly, D. Kesh, and C. Saha. Practical algorithms for tracking database join sizes. In FSTTCS 2005, pages 297--309.
 
12
P. J. Haas and J. M. Hellerstein. Ripple joins for online aggregation. In SIGMOD 1999, pages 287--298.
 
13
D. Kempe, A. Dobra, and J. Gehrke. Computing aggregate information using gossip. In FOCS 2003.
 
14
MassDAL. http://www.cs.rutgers.edu/~muthu/massdal.html.
 
15
D. J. Olive. A simple confidence interval for the median. Manuscript, 2005.
 
16
F. Pennecchi and L. Callegaro. Between the mean and the median: the L<sub>p</sub> estimator. Metrologia, 43(3):213--219, 2006.
 
17
R. M. Price and D. G. Bonett. Estimating the variance of the sample median. J. Statistical Computation and Simulation, 68(3):295--305, 2001.
18
 
19
J. Shao. Mathematical Statistics. Springer-Verlag, 1999.
 
20