ACM Home Page
Please provide us with feedback. Feedback
Summarizing data using bottom-k sketches
Full text PdfPdf (269 KB)
Source
Annual ACM Symposium on Principles of Distributed Computing archive
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing table of contents
Portland, Oregon, USA
SESSION: Radio networks and aggregation table of contents
Pages: 225 - 234  
Year of Publication: 2007
ISBN:978-1-59593-616-5
Authors
Edith Cohen  AT&T Labs-Research
Haim Kaplan  Tel Aviv University
Sponsors
SIGOPS: ACM Special Interest Group on Operating Systems
ACM: Association for Computing Machinery
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 50,   Citation Count: 6
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1281100.1281133
What is a DOI?

ABSTRACT

A Bottom-sketch is a summary of a set of items with nonnegative weights that supports approximate query processing. A sketch is obtained by associating with each item in a ground set an independent random rank drawn from a probability distribution that depends on the weight of the item and including the k items with smallest rank value.

Bottom-k sketches are an alternative to k-mins sketches[9], which consist of the k minimum ranked items in k independent rank assignments,and of min-hash [5] sketches, where hash functions replace random rank assignments. Sketches support approximate aggregations, including weight and selectivity of a subpopulation. Coordinated sketches of multiple subsets over the same ground set support subset-relation queries such as Jaccard similarity or the weight of the union. All-distances sketches are applicable for datasets where items lie in some metric space such as data streams (time) or networks. These sketches compactly encode the respective plain sketches of all neighborhoods of a location. These sketches support queries posed over time windows or neighborhoods and time/spatially decaying aggregates.

An important advantage of bottom-k sketches, established in a line of recent work, is much tighter estimators for several basic aggregates. To materialize this benefit, we must adapt traditional k-mins applications to use bottom-k sketches. We propose all-distances bottom-k sketches and develop and analyze data structures that incrementally construct bottom-k sketches and all-distances bottom-k sketches.

Another advantage of bottom-k sketches is that when the data is represented explicitly, they can be obtained much more efficiently than k-mins sketches. We show that k-mins sketches can be derived from respective bottom-k sketches, which enables the use of bottom-k sketches with off-the-shelf k-mins estimators. (In fact, we obtain tighter estimators since each bottom-k sketch is adistribution over k-mins sketches).


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
A. Broder. Filtering near-duplicate documents. In FUN, 1998.
 
4
 
5
 
6
 
7
B. Chazelle and L. Guibas. Fractional cascading: I. a data structuring technique. Algorithmica, 1(2):133--162, 1986.
 
8
Y.-J. Chiang and R. Tamassia. Dynamic algorithms in computational geometry. Proceedings of the IEEE, 80(9):1412--1434, 1992.
 
9
 
10
11
 
12
E. Cohen and H. Kaplan. Sketches and estimators for subpopulation weight queries. Manuscript, 2007.
 
13
14
 
15
E. Cohen, Y.-M. Wang, and G. Suri. When piecewise determinism is almost true. In Proc. Pacific Rim International Symposium on Fault-Tolerant Systems, pages 66--71, December 1995.
 
16
 
17
18
 
19
20
21
 
22
23
24


Collaborative Colleagues:
Edith Cohen: colleagues
Haim Kaplan: colleagues