| Space-optimal heavy hitters with strong error bounds |
| Full text |
Pdf
(418 KB)
|
Source
|
Symposium on Principles of Database Systems
archive
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
table of contents
Providence, Rhode Island, USA
SESSION: Stream processing
table of contents
Pages 157-166
Year of Publication: 2009
ISBN:978-1-60558-553-6
|
|
Authors
|
|
Radu Berinde
|
MIT, Cambridge, MA, USA
|
|
Graham Cormode
|
AT&T Labs - Research, Florham Park, NJ, USA
|
|
Piotr Indyk
|
MIT, Cambridge, MA, USA
|
|
Martin J. Strauss
|
University of Michigan, Ann Arbor,, MI, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 18, Downloads (12 Months): 43, Citation Count: 0
|
|
|
ABSTRACT
The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worst-case guarantees on real data. This leads to the question of whether some stronger bounds can be guaranteed. We answer this in the positive by showing that a class of "counter-based algorithms" (including the popular and very space-efficient FREQUENT and SPACESAVING algorithms) provide much stronger approximation guarantees than previously known. Specifically, we show that errors in the approximation of individual elements do not depend on the frequencies of the most frequent elements, but only on the frequency of the remaining "tail." This shows that counter-based methods are the most space-efficient (in fact, space-optimal) algorithms having this strong error bound. This tail guarantee allows these algorithms to solve the "sparse recovery" problem. Here, the goal is to recover a faithful representation of the vector of frequencies, f. We prove that using space O(k), the algorithms construct an approximation f* to the frequency vector f so that the L1 error ||f -- f*||1 is close to the best possible error minf2 ||f2 -- f||1, where f2 ranges over all vectors with at most k non-zero entries. This improves the previously best known space bound of about O(k log n) for streams without element deletions (where n is the size of the domain from which stream elements are drawn). Other consequences of the tail guarantees are results for skewed (Zipfian) data, and guarantees for accuracy of merging multiple summarized streams.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Arasu, S. Babu, and J. Widom. Cql: A language for continuous queries over streams and relations. Proceedings of the 9th DBPL International Confenrence on Data Base and Programming Languages, pages 1--11, 2003.
|
| |
2
|
R. Berinde, A. Gilbert, P. Indyk, H. Karloff, and M. Strauss. Combining geometry and combinatorics: a unified approach to sparse signal recovery. Allerton, 2008.
|
| |
3
|
R. Berinde, P. Indyk, and M. Ruzic. Practical near-optimal sparse recovery in the l1 norm. Allerton, 2008.
|
 |
4
|
|
| |
5
|
|
| |
6
|
P. Bose, E. Kranakis, P. Morin, and Y. Tang. Bounds for frequency estimation of packet streams. Proceedings of the 10th International Colloquium on Structural Information and Communication Complexity, pages 33--42, 2003.
|
| |
7
|
E.J. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1208--1223, 2006.
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
Graham Cormode , Flip Korn , S. Muthukrishnan , Divesh Srivastava, Finding hierarchical heavy hitters in data streams, Proceedings of the 29th international conference on Very large data bases, p.464-475, September 09-12, 2003, Berlin, Germany
|
| |
12
|
G. Cormode and S. Muthukrishnan. Improved data stream summaries: The count-min sketch and its applications. FSTTCS, 2004.
|
| |
13
|
|
| |
14
|
D.L. Donoho. Compressed sensing. Unpublished manuscript, Oct. 2004.
|
 |
15
|
|
| |
16
|
|
 |
17
|
A. C. Gilbert , M. J. Strauss , J. A. Tropp , R. Vershynin, One sketch for all: fast algorithms for compressed sensing, Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, June 11-13, 2007, San Diego, California, USA
[doi> 10.1145/1250790.1250824]
|
 |
18
|
Jiawei Han , Jian Pei , Guozhu Dong , Ke Wang, Efficient computation of Iceberg cubes with complex measures, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.1-12, May 21-24, 2001, Santa Barbara, California, United States
|
 |
19
|
|
 |
20
|
|
| |
21
|
P. Indyk. Sketching, streaming and sublinear-space algorithms. Graduate course notes, available at http://stellar.mit.edu/S/course/6/fa07/6.895/, 2007.
|
| |
22
|
|
 |
23
|
|
| |
24
|
|
| |
25
|
A. Metwally, D. Agrawal, and A. Abbabi. Efficient computation of frequent and top-k elements in data streams. International Conference on Database Theory, pages 398--412, 2005.
|
| |
26
|
J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:142--152, 1982.
|
| |
27
|
|
| |
28
|
Compressed sensing resources. Available at http://www.dsp.ece.rice.edu/cs/, 2006. Rice DSP Group.
|
 |
29
|
Nisheeth Shrivastava , Chiranjeeb Buragohain , Divyakant Agrawal , Subhash Suri, Medians and beyond: new aggregation techniques for sensor networks, Proceedings of the 2nd international conference on Embedded networked sensor systems, November 03-05, 2004, Baltimore, MD, USA
[doi> 10.1145/1031495.1031524]
|
| |
30
|
G. Zipf. Human Behavior and The Principle of Least Effort. Addison-Wesley, 1949.
|
|