ACM Home Page
Please provide us with feedback. Feedback
Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets
Full text PdfPdf (2.90 MB)
Source
ACM Transactions on Computer Systems (TOCS) archive
Volume 27 ,  Issue 1  (February 2009) table of contents
Article No. 2  
Year of Publication: 2009
ISSN:0734-2071
Authors
N. Ntarmos  R.A. Computer Technology Institute and University of Patras, Rio, Patras, Greece
P. Triantafillou  R.A. Computer Technology Institute and University of Patras, Rio, Patras, Greece
G. Weikum  Max-Planck-Institut für Informatik, Saarbrücken, Germany
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): n/a,   Downloads (12 Months): n/a,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1482619.1482621
What is a DOI?

ABSTRACT

Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general formal problem addressed in this article is computing the network-wide distinct number of items with some property (e.g., distinct files with file name containing “spiderman”) where each node in the network holds an arbitrary subset, possibly overlapping the subsets of other nodes. The key requirements that a viable approach must satisfy are: (1) scalability towards very large network size, (2) efficiency regarding messaging overhead, (3) load balance of storage and access, (4) accuracy of the cardinality estimation, and (5) simplicity and easy integration in applications. This article contributes the DHS (Distributed Hash Sketches) method for this problem setting: a distributed, scalable, efficient, and accurate multiset cardinality estimator. DHS is based on hash sketches for probabilistic counting, but distributes the bits of each counter across network nodes in a judicious manner based on principles of Distributed Hash Tables, paying careful attention to fast access and aggregation as well as update costs. The article discusses various design choices, exhibiting tunable trade-offs between estimation accuracy, hop-count efficiency, and load distribution fairness. We further contribute a full-fledged, publicly available, open-source implementation of all our methods, and a comprehensive experimental evaluation for various settings.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
 
5
 
6
 
7
Bawa, M., Garcia-Molina, H., Gionis, A., and Motwani, R. 2003. Estimating aggregates on a peer-to-peer network. Tech. rep., Computer Science Department, Stanford University.
8
9
10
11
 
12
 
13
 
14
Cormode, G. and Muthukrishnan, S. 2004. An improved data stream summary: The count-min sketch and its applications. In Proceedings of the Latin American Symposium on Theoretical Informatics (LATIN).
 
15
 
16
Damgaard, C. and Weiner, J. 2000. Describing inequality in plant size or fecundity. Ecology 81, 1139--1142.
 
17
Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2004. Sketch-Based multi-query processing over data streams. In Proceedings of the International Conference on Extending Database Technology (EDBT).
 
18
 
19
Durand, M. and Flajolet, P. 2003. Loglog counting of large cardinalities. In Proceedings of the Annual European Symposium on Algorithms (ESA).
 
20
 
21
FreeDHS. 2006. Homepage. http://netcins.ceid.upatras.gr/DHS.php.
 
22
FreePastry. 2002. Homepage. http://freepastry.org/FreePastry/.
23
24
 
25
Gnutella. 2001. Homepage. http://gnutella.wego.com/.
26
 
27
Gupta, A., Agrawal, D., and El Abbadi, A. 2003. Approximate range selection queries in peer-to-peer systems. In Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR).
 
28
Hadjieleftheriou, M., Byers, J. W., and Kollios, G. 2005. Robust sketching and aggregation of distributed data streams. Tech. rep. 2005-011, Computer Science Department, Boston University.
 
29
 
30
 
31
Huebsch, R., Chun, B. N., Hellerstein, J. M., Loo, B. T., Maniatis, P., Roscoe, T., Shenker, S., Stoica, I., and Yumerefendi, A. R. 2005. The architecture of PIER: An Internet-scale query processor. In Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR).
 
32
 
33
Ives, Z., Khandelwal, N., Kapur, A., and Cakir, M. 2005. ORCHESTRA: Rapid, collaborative sharing of dynamic data. In Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR).
 
34
35
 
36
 
37
Koloniari, G. and Pitoura, E. 2004. Content-based routing of path queries in peer-to-peer systems. In Proceedings of the International Conference on Extending Database Technology (EDBT).
 
38
 
39
40
41
 
42
43
 
44
Montresor, A., Meling, H., and Babaoǧlu, Ö. 2002. Messor: Load-Balancing through a swarm of autonomous agents. In Proceedings of the Workshop on Agent and Peer-to-Peer Systems.
 
45
Ng, W. S., Ooi, B. C., Tan, K. L., and Zhou, A. 2003. PeerDB: A P2P-based system for distributed data sharing. In Proceedings of the International Conference on Data Engineering (ICDE).
 
46
Ntarmos, N. and Triantafillou, P. 2004. AESOP: Altruism-Endowed self-organizing peers. In Proceedings of the International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P).
 
47
 
48
Palmer, C. R., Siganos, G., Faloutsos, M., Faloutsos, C., and Gibbons, P. B. 2001. The connectivity and fault-tolerance of the Internet topology. In Proceedings of the Workshop on Network-Related Data Management (NRDM).
 
49
Papadimos, V., Maier, D., and Tufte, K. 2003. Distributed query processing and catalogs for peer-to-peer systems. In Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR).
 
50
Pitoura, T., Ntarmos, N., and Triantafillou, P. 2006. Replication, load balancing, and efficient range query processing in DHT data networks. In Proceedings of the International Conference on Extending Database Technology (EDBT).
 
51
Pitoura, T. and Triantafillou, P. 2007. Load distribution fairness in p2p data management systems. In Proceedings of the International Conference on Data Engineering (ICDE).
52
 
53
 
54
Saroiu, S., Gummadi, P. K., and Gribble, S. D. 2002. A measurement study of peer-to-peer file sharing systems. In Proceedings of the Multimedia Computing and Networking Conference (MMCN).
55
 
56
Triantafillou, P. and Pitoura, T. 2003. Towards a unifying framework for complex query processing over structured peer-to-peer data networks. In Proceedings of the International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P).
57
58
 
59
 
60

Collaborative Colleagues:
N. Ntarmos: colleagues
P. Triantafillou: colleagues
G. Weikum: colleagues