|
ABSTRACT
Counting items in a distributed system, and estimating the cardinality of multisets in particular, is important for a large variety of applications and a fundamental building block for emerging Internet-scale information systems. Examples of such applications range from optimizing query access plans in peer-to-peer data sharing, to computing the significance (rank/score) of data items in distributed information retrieval. The general formal problem addressed in this article is computing the network-wide distinct number of items with some property (e.g., distinct files with file name containing “spiderman”) where each node in the network holds an arbitrary subset, possibly overlapping the subsets of other nodes. The key requirements that a viable approach must satisfy are: (1) scalability towards very large network size, (2) efficiency regarding messaging overhead, (3) load balance of storage and access, (4) accuracy of the cardinality estimation, and (5) simplicity and easy integration in applications. This article contributes the DHS (Distributed Hash Sketches) method for this problem setting: a distributed, scalable, efficient, and accurate multiset cardinality estimator. DHS is based on hash sketches for probabilistic counting, but distributes the bits of each counter across network nodes in a judicious manner based on principles of Distributed Hash Tables, paying careful attention to fast access and aggregation as well as update costs. The article discusses various design choices, exhibiting tunable trade-offs between estimation accuracy, hop-count efficiency, and load distribution fairness. We further contribute a full-fledged, publicly available, open-source implementation of all our methods, and a comprehensive experimental evaluation for various settings.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Noga Alon , Phillip B. Gibbons , Yossi Matias , Mario Szegedy, Tracking join and self-join sizes in limited storage, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.10-20, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.303978]
|
 |
3
|
Noga Alon , Yossi Matias , Mario Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, p.20-29, May 22-24, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237814.237823]
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
Bawa, M., Garcia-Molina, H., Gionis, A., and Motwani, R. 2003. Estimating aggregates on a peer-to-peer network. Tech. rep., Computer Science Department, Stanford University.
|
 |
8
|
|
 |
9
|
Kevin Beyer , Peter J. Haas , Berthold Reinwald , Yannis Sismanis , Rainer Gemulla, On synopses for distinct-value estimation under multiset operations, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
[doi> 10.1145/1247480.1247504]
|
 |
10
|
Ashwin R. Bharambe , Mukesh Agrawal , Srinivasan Seshan, Mercury: supporting scalable multi-attribute range queries, Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications, August 30-September 03, 2004, Portland, Oregon, USA
|
 |
11
|
Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Random sampling for histogram construction: how much is enough?, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.436-447, June 01-04, 1998, Seattle, Washington, United States
|
| |
12
|
|
| |
13
|
|
| |
14
|
Cormode, G. and Muthukrishnan, S. 2004. An improved data stream summary: The count-min sketch and its applications. In Proceedings of the Latin American Symposium on Theoretical Informatics (LATIN).
|
| |
15
|
Frank Dabek , Jinyang Li , Emil Sit , James Robertson , M. Frans Kaashoek , Robert Morris, Designing a DHT for low latency and high throughput, Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation, p.7-7, March 29-31, 2004, San Francisco, California
|
| |
16
|
Damgaard, C. and Weiner, J. 2000. Describing inequality in plant size or fecundity. Ecology 81, 1139--1142.
|
| |
17
|
Dobra, A., Garofalakis, M., Gehrke, J., and Rastogi, R. 2004. Sketch-Based multi-query processing over data streams. In Proceedings of the International Conference on Extending Database Technology (EDBT).
|
| |
18
|
|
| |
19
|
Durand, M. and Flajolet, P. 2003. Loglog counting of large cardinalities. In Proceedings of the Annual European Symposium on Algorithms (ESA).
|
| |
20
|
|
| |
21
|
FreeDHS. 2006. Homepage. http://netcins.ceid.upatras.gr/DHS.php.
|
| |
22
|
FreePastry. 2002. Homepage. http://freepastry.org/FreePastry/.
|
 |
23
|
|
 |
24
|
Sumit Ganguly , Phillip B. Gibbons , Yossi Matias , Avi Silberschatz, Bifocal sampling for skew-resistant join size estimation, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.271-281, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
25
|
Gnutella. 2001. Homepage. http://gnutella.wego.com/.
|
 |
26
|
K. Gummadi , R. Gummadi , S. Gribble , S. Ratnasamy , S. Shenker , I. Stoica, The impact of DHT routing geometry on resilience and proximity, Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, August 25-29, 2003, Karlsruhe, Germany
[doi> 10.1145/863955.863998]
|
| |
27
|
Gupta, A., Agrawal, D., and El Abbadi, A. 2003. Approximate range selection queries in peer-to-peer systems. In Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR).
|
| |
28
|
Hadjieleftheriou, M., Byers, J. W., and Kollios, G. 2005. Robust sketching and aggregation of distributed data streams. Tech. rep. 2005-011, Computer Science Department, Boston University.
|
| |
29
|
Matthew Harren , Joseph M. Hellerstein , Ryan Huebsch , Boon Thau Loo , Scott Shenker , Ion Stoica, Complex Queries in DHT-based Peer-to-Peer Networks, Revised Papers from the First International Workshop on Peer-to-Peer Systems, p.242-259, March 07-08, 2002
|
| |
30
|
Nicholas J. A. Harvey , Michael B. Jones , Stefan Saroiu , Marvin Theimer , Alec Wolman, SkipNet: a scalable overlay network with practical locality properties, Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems, p.9-9, March 26-28, 2003, Seattle, WA
|
| |
31
|
Huebsch, R., Chun, B. N., Hellerstein, J. M., Loo, B. T., Maniatis, P., Roscoe, T., Shenker, S., Stoica, I., and Yumerefendi, A. R. 2005. The architecture of PIER: An Internet-scale query processor. In Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR).
|
| |
32
|
Ryan Huebsch , Joseph M. Hellerstein , Nick Lanham , Boon Thau Loo , Scott Shenker , Ion Stoica, Querying the internet with PIER, Proceedings of the 29th international conference on Very large data bases, p.321-332, September 09-12, 2003, Berlin, Germany
|
| |
33
|
Ives, Z., Khandelwal, N., Kapur, A., and Cakir, M. 2005. ORCHESTRA: Rapid, collaborative sharing of dynamic data. In Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR).
|
| |
34
|
|
 |
35
|
David Karger , Eric Lehman , Tom Leighton , Rina Panigrahy , Matthew Levine , Daniel Lewin, Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, p.654-663, May 04-06, 1997, El Paso, Texas, United States
[doi> 10.1145/258533.258660]
|
| |
36
|
|
| |
37
|
Koloniari, G. and Pitoura, E. 2004. Content-based routing of path queries in peer-to-peer systems. In Proceedings of the International Conference on Extending Database Technology (EDBT).
|
| |
38
|
|
| |
39
|
|
 |
40
|
Richard J. Lipton , Jeffrey F. Naughton , Donovan A. Schneider, Practical selectivity estimation through adaptive sampling, Proceedings of the 1990 ACM SIGMOD international conference on Management of data, p.1-11, May 23-26, 1990, Atlantic City, New Jersey, United States
|
 |
41
|
|
| |
42
|
|
 |
43
|
Sebastian Michel , Matthias Bender , Nikos Ntarmos , Peter Triantafillou , Gerhard Weikum , Christian Zimmer, Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
[doi> 10.1145/1183614.1183643]
|
| |
44
|
Montresor, A., Meling, H., and Babaoǧlu, Ö. 2002. Messor: Load-Balancing through a swarm of autonomous agents. In Proceedings of the Workshop on Agent and Peer-to-Peer Systems.
|
| |
45
|
Ng, W. S., Ooi, B. C., Tan, K. L., and Zhou, A. 2003. PeerDB: A P2P-based system for distributed data sharing. In Proceedings of the International Conference on Data Engineering (ICDE).
|
| |
46
|
Ntarmos, N. and Triantafillou, P. 2004. AESOP: Altruism-Endowed self-organizing peers. In Proceedings of the International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P).
|
| |
47
|
|
| |
48
|
Palmer, C. R., Siganos, G., Faloutsos, M., Faloutsos, C., and Gibbons, P. B. 2001. The connectivity and fault-tolerance of the Internet topology. In Proceedings of the Workshop on Network-Related Data Management (NRDM).
|
| |
49
|
Papadimos, V., Maier, D., and Tufte, K. 2003. Distributed query processing and catalogs for peer-to-peer systems. In Proceedings of the ACM SIGMOD/VLDB Biennial Conference on Innovative Data Systems Research (CIDR).
|
| |
50
|
Pitoura, T., Ntarmos, N., and Triantafillou, P. 2006. Replication, load balancing, and efficient range query processing in DHT data networks. In Proceedings of the International Conference on Extending Database Technology (EDBT).
|
| |
51
|
Pitoura, T. and Triantafillou, P. 2007. Load distribution fairness in p2p data management systems. In Proceedings of the International Conference on Data Engineering (ICDE).
|
 |
52
|
Sylvia Ratnasamy , Paul Francis , Mark Handley , Richard Karp , Scott Schenker, A scalable content-addressable network, Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, p.161-172, August 2001, San Diego, California, United States
|
| |
53
|
Sean Rhea , Dennis Geels , Timothy Roscoe , John Kubiatowicz, Handling churn in a DHT, Proceedings of the annual conference on USENIX Annual Technical Conference, p.10-10, June 27-July 02, 2004, Boston, MA
|
| |
54
|
Saroiu, S., Gummadi, P. K., and Gribble, S. D. 2002. A measurement study of peer-to-peer file sharing systems. In Proceedings of the Multimedia Computing and Networking Conference (MMCN).
|
 |
55
|
Ion Stoica , Robert Morris , David Karger , M. Frans Kaashoek , Hari Balakrishnan, Chord: A scalable peer-to-peer lookup service for internet applications, Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, p.149-160, August 2001, San Diego, California, United States
|
| |
56
|
Triantafillou, P. and Pitoura, T. 2003. Towards a unifying framework for complex query processing over structured peer-to-peer data networks. In Proceedings of the International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P).
|
 |
57
|
|
 |
58
|
Praveen Yalagandula , Mike Dahlin, A scalable distributed information management system, Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications, August 30-September 03, 2004, Portland, Oregon, USA
|
| |
59
|
|
| |
60
|
|
|