|
ABSTRACT
This paper presents a detailed examination of how the dynamic and heterogeneous nature of real-world peer-to-peer systems can introduce bias into the selection of representative samples of peer properties (e.g., degree, link bandwidth, number of files shared). We propose the Metropolized Random Walk with Backtracking (MRWB) as a viable and promising technique for collecting nearly unbiased samples and conduct an extensive simulation study to demonstrate that our technique works well for a wide variety of commonly-encountered peer-to-peer network conditions. We have implemented the MRWB algorithm for selecting peer addresses uniformly at random into a tool called ion-sampler. Using the Gnutella network, we empirically show that ion-sampler. yields more accurate samples than tools that rely on commonly-used sampling techniques and results in dramatic improvements in efficiency and scalability compared to performing a full crawl.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Ion Stoica , Robert Morris , David Liben-Nowell , David R. Karger , M. Frans Kaashoek , Frank Dabek , Hari Balakrishnan, Chord: a scalable peer-to-peer lookup protocol for internet applications, IEEE/ACM Transactions on Networking (TON), v.11 n.1, p.17-32, February 2003
[doi> 10.1109/TNET.2002.808407]
|
 |
2
|
Sylvia Ratnasamy , Paul Francis , Mark Handley , Richard Karp , Scott Schenker, A scalable content-addressable network, Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, p.161-172, August 2001, San Diego, California, United States
|
| |
3
|
|
| |
4
|
R. Bhagwan, S. Savage, and G. Voelker, "Understanding availability," presented at the 2003 Int. Workshop on Peer-to-Peer Systems, Berkeley, CA.
|
 |
5
|
|
| |
6
|
S. Chib and E. Greenberg, "Understanding the Metropolis-Hastings algorithm," The Americian Statistician, vol. 49, no. 4, pp. 327-335, Nov. 1995.
|
| |
7
|
W. Hastings, "Monte carlo sampling methods using Markov chains and their applications," Biometrika, vol. 57, pp. 97-109, 1970.
|
| |
8
|
N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, "Equations of state calculations by fast computing machines," J. Chem. Phys., vol. 21, pp. 1087-1092, 1953.
|
| |
9
|
|
 |
10
|
|
| |
11
|
D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, and W. Willinger, "Sampling techniques for large, dynamic graphs," presented at the 2006 Global Internet Symp., Barcelona, Spain, Apr. 2006.
|
| |
12
|
D. Stutzbach and R. Rejaie, "Capturing accurate snapshots of the Gnutella network," in Proc. 2005 Global Internet Symp., Miami, FL, Mar. 2005, pp. 127-132.
|
| |
13
|
B. Bollobás, "A probabilistic proof of an asymptotic formula for the number of labelled regular graphs," Eur. J. Combinator., vol. 1, pp. 311-316, 1980.
|
| |
14
|
|
| |
15
|
|
| |
16
|
V. Krishnamurthy, J. Sun, M. Faloutsos, and S. Tauro, "Sampling Internet topologies: How small can we go?," in Proc. 2003 Int. Conf. Internet Computing, Las Vegas, NV, Jun. 2003, pp. 577-580.
|
| |
17
|
V. Krishnamurthy, M. Faloutsos, M. Chrobak, L. Lao, and J.-H. C. G. Percus, "Reducing large Internet topologies for faster simulations," presented at the 2005 IFIP Networking Conf., Waterloo, Ontario, Canada, May 2005.
|
| |
18
|
M. P. H. Stumpf, C. Wiuf, and R. M. May, "Subnets of scale-free networks are not scale-free: Sampling properties of networks," Proc. National Academy of Sciences, vol. 102, no. 12, pp. 4221-4224, Mar. 2005.
|
| |
19
|
|
| |
20
|
A. Lakhina, J. W. Byers, M. Crovella, and P. Xie, "Sampling biases in IP topology measurements," presented at the IEEE INFOCOM 2003, San Francisco, CA.
|
 |
21
|
Dimitris Achlioptas , Aaron Clauset , David Kempe , Cristopher Moore, On the bias of traceroute sampling: or, power-law degree distributions in regular graphs, Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, May 22-24, 2005, Baltimore, MD, USA
[doi> 10.1145/1060590.1060693]
|
| |
22
|
P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles, "Methods for sampling pages uniformly from the World Wide Web," in Proc. AAAI Fall Symp. Using Uncertainty Within Computation, 2001, pp. 121-128.
|
| |
23
|
|
| |
24
|
L. Lovász, "Random walks on graphs: A survey," Combinatorics: Paul Erdös is Eighty, vol. 2, pp. 1-46, 1993.
|
 |
25
|
Qin Lv , Pei Cao , Edith Cohen , Kai Li , Scott Shenker, Search and replication in unstructured peer-to-peer networks, Proceedings of the 16th international conference on Supercomputing, June 22-26, 2002, New York, New York, USA
[doi> 10.1145/514191.514206]
|
 |
26
|
Yatin Chawathe , Sylvia Ratnasamy , Lee Breslau , Nick Lanham , Scott Shenker, Making gnutella-like P2P systems scalable, Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, August 25-29, 2003, Karlsruhe, Germany
[doi> 10.1145/863955.864000]
|
| |
27
|
C. Gkantsidis, M. Mihail, and A. Saberi, "Random walks in peer-to-peer networks," presented at the IEEE INFOCOM 2004, Hong Kong.
|
| |
28
|
V. Vishnumurthy and P. Francis, "On heterogeneous overlay construction and random node selection in unstructured P2P networks," presented at the IEEE INFOCOM 2006, Barcelona, Spain, Apr. 2006.
|
| |
29
|
M. Salganik and D. Heckathorn, "Sampling and estimation in hidden populations using respondent-driven sampling," Sociological Methodology , vol. 34, pp. 193-239, 2004.
|
| |
30
|
D. Heckathorn, "Respondent-driven sampling: a new approach to the study of hidden populations," Social Problems, vol. 44, no. 2, pp. 174-199, 1997.
|
| |
31
|
L. Goodman, "Snowball sampling," Ann. Math. Statist., vol. 32, pp. 148-170, 1961.
|
| |
32
|
E. Deaux and J. Callaghan, "Key informant versus self-report estimates of health behavior," Eval. Rev., vol. 9, pp. 365-368, 1985.
|
| |
33
|
J. Watters and P. Biernack, "Targeted sampling: Options for the study of hidden populations," Annu. Rev. Sociol., vol. 12, pp. 401-429, 1989.
|
| |
34
|
M. Salganik, "Variance estimation, design effects, and sample size calculation for respondent-driven sampling," J. Urban Health, vol. 83, pp. 98-112, 2006, Suppl. 1.
|
 |
35
|
Jure Leskovec , Jon Kleinberg , Christos Faloutsos, Graphs over time: densification laws, shrinking diameters and possible explanations, Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
[doi> 10.1145/1081870.1081893]
|
| |
36
|
A. Bonato, "A survey of models of the web graph," Combinatorial and Algorithmic Aspects of Networking, pp. 159-172, 2004.
|
| |
37
|
J. Pouwelse, P. Garbacki, D. Epema, and H. Sips, "The BitTorrent P2P file-sharing system: Measurements and analysis," presented at the 2005 Int. Workshop on Peer-to-Peer Systems (IPTPS), Ithaca, NY.
|
| |
38
|
M. Izal, G. Urvoy-Keller, E. W. Biersack, P. A. Felber, A. A. Hamra, and L. Garces-Erice, "Dissecting BitTorrent: Five months in a torrent's lifetime," presented at the Passive and Active Measurement Conf. (PAM), Antibes Juan-les-Pins, France, Apr. 2004.
|
| |
39
|
|
 |
40
|
|
 |
41
|
|
| |
42
|
Sean Rhea , Dennis Geels , Timothy Roscoe , John Kubiatowicz, Handling churn in a DHT, Proceedings of the annual conference on USENIX Annual Technical Conference, p.10-10, June 27-July 02, 2004, Boston, MA
|
| |
43
|
J. Li, J. Stribling, F. Kaashoek, R. Morris, and T. Gil, "A performance vs. cost framework for evaluating DHT design tradeoffs under churn," presented at the IEEE INFOCOM 2005, Miami, FL.
|
| |
44
|
F. E. Bustamante and Y. Qiao, "Friendships that last: Peer lifespan and its role in P2P protocols," in Proc. Int. Workshop on Web Content Caching and Distribution, 2003, p. 2.
|
| |
45
|
|
 |
46
|
Krishna P. Gummadi , Richard J. Dunn , Stefan Saroiu , Steven D. Gribble , Henry M. Levy , John Zahorjan, Measurement, modeling, and analysis of a peer-to-peer file-sharing workload, Proceedings of the nineteenth ACM symposium on Operating systems principles, October 19-22, 2003, Bolton Landing, NY, USA
|
 |
47
|
|
| |
48
|
H. Dämpfling, Gnutella Web Caching System: Version 2 Specifications Client Developers' Guide. Jun. 2003 [Online]. Available: http://www. gnucleus.com/gwebcache/newgwc.html
|
| |
49
|
P. Karbhari, M. Ammar, A. Dhamdhere, H. Raj, G. Riley, and E. Zegura, "Bootstrapping in Gnutella: A measurement study," presented at the 2004 Passive and Active Measurement Conf. (PAM) Antibes Juan-les-Pins, France, Apr. 2004.
|
| |
50
|
R. Srinivasan, Importance Sampling--Application in Communications and Detection. Berlin, Germany: Springer-Verlag, 2002.
|
| |
51
|
D. Stutzbach and R. Rejaie, "Improving lookup performance over a widely-deployed DHT," presented at the IEEE INFOCOM 2006, Barcelona, Spain, Apr. 2006.
|
|