ACM Home Page
Please provide us with feedback. Feedback
Space-optimal heavy hitters with strong error bounds
Full text PdfPdf (418 KB)
Source
Symposium on Principles of Database Systems archive
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems table of contents
Providence, Rhode Island, USA
SESSION: Stream processing table of contents
Pages 157-166  
Year of Publication: 2009
ISBN:978-1-60558-553-6
Authors
Radu Berinde  MIT, Cambridge, MA, USA
Graham Cormode  AT&T Labs - Research, Florham Park, NJ, USA
Piotr Indyk  MIT, Cambridge, MA, USA
Martin J. Strauss  University of Michigan, Ann Arbor,, MI, USA
Sponsors
ACM: Association for Computing Machinery
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
SIGMOD: ACM Special Interest Group on Management of Data
SIGART: ACM Special Interest Group on Artificial Intelligence
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 18,   Downloads (12 Months): 43,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1559795.1559819
What is a DOI?

ABSTRACT

The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worst-case guarantees on real data. This leads to the question of whether some stronger bounds can be guaranteed. We answer this in the positive by showing that a class of "counter-based algorithms" (including the popular and very space-efficient FREQUENT and SPACESAVING algorithms) provide much stronger approximation guarantees than previously known. Specifically, we show that errors in the approximation of individual elements do not depend on the frequencies of the most frequent elements, but only on the frequency of the remaining "tail." This shows that counter-based methods are the most space-efficient (in fact, space-optimal) algorithms having this strong error bound.

This tail guarantee allows these algorithms to solve the "sparse recovery" problem. Here, the goal is to recover a faithful representation of the vector of frequencies, f. We prove that using space O(k), the algorithms construct an approximation f* to the frequency vector f so that the L1 error ||f -- f*||1 is close to the best possible error minf2 ||f2 -- f||1, where f2 ranges over all vectors with at most k non-zero entries. This improves the previously best known space bound of about O(k log n) for streams without element deletions (where n is the size of the domain from which stream elements are drawn). Other consequences of the tail guarantees are results for skewed (Zipfian) data, and guarantees for accuracy of merging multiple summarized streams.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. Arasu, S. Babu, and J. Widom. Cql: A language for continuous queries over streams and relations. Proceedings of the 9th DBPL International Confenrence on Data Base and Programming Languages, pages 1--11, 2003.
 
2
R. Berinde, A. Gilbert, P. Indyk, H. Karloff, and M. Strauss. Combining geometry and combinatorics: a unified approach to sparse signal recovery. Allerton, 2008.
 
3
R. Berinde, P. Indyk, and M. Ruzic. Practical near-optimal sparse recovery in the l1 norm. Allerton, 2008.
4
 
5
 
6
P. Bose, E. Kranakis, P. Morin, and Y. Tang. Bounds for frequency estimation of packet streams. Proceedings of the 10th International Colloquium on Structural Information and Communication Complexity, pages 33--42, 2003.
 
7
E.J. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1208--1223, 2006.
 
8
 
9
 
10
 
11
 
12
G. Cormode and S. Muthukrishnan. Improved data stream summaries: The count-min sketch and its applications. FSTTCS, 2004.
 
13
 
14
D.L. Donoho. Compressed sensing. Unpublished manuscript, Oct. 2004.
15
 
16
17
18
19
20
 
21
P. Indyk. Sketching, streaming and sublinear-space algorithms. Graduate course notes, available at http://stellar.mit.edu/S/course/6/fa07/6.895/, 2007.
 
22
23
 
24
 
25
A. Metwally, D. Agrawal, and A. Abbabi. Efficient computation of frequent and top-k elements in data streams. International Conference on Database Theory, pages 398--412, 2005.
 
26
J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:142--152, 1982.
 
27
 
28
Compressed sensing resources. Available at http://www.dsp.ece.rice.edu/cs/, 2006. Rice DSP Group.
29
 
30
G. Zipf. Human Behavior and The Principle of Least Effort. Addison-Wesley, 1949.

Collaborative Colleagues:
Radu Berinde: colleagues
Graham Cormode: colleagues
Piotr Indyk: colleagues
Martin J. Strauss: colleagues