|
ABSTRACT
In this article we present an experimental study of the properties of webgraphs. We study a large crawl from 2001 of 200M pages and about 1.4 billion edges, made available by the WebBase project at Stanford, as well as several synthetic ones generated according to various models proposed recently. We investigate several topological properties of such graphs, including the number of bipartite cores and strongly connected components, the distribution of degrees and PageRank values and some correlations; we present a comparison study of the models against these measures.Our findings are that (i) the WebBase sample differs slightly from the (older) samples studied in the literature, and (ii) despite the fact that these models do not catch all of its properties, they do exhibit some peculiar behaviors not found, for example, in the models from classical random graph theory.Moreover we developed a software library able to generate and measure massive graphs in secondary memory; this library is publicy available under the GPL licence. We discuss its implementation and some computational issues related to secondary memory graph algorithms.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
Bollobas, B. and Riordan, O. 2003. Robustness and vulnerability of scale-free random graphs. Internet Math. 1, 1, 1--35.
|
| |
4
|
Barabasi, A. and Albert, A. 1999. Emergence of scaling in random networks. Science 286, 509.
|
| |
5
|
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2002. Ubicrawler: A scalable fully distributed web crawler.
|
 |
6
|
|
| |
7
|
|
| |
8
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.309-320, June 2000, Amsterdam, The Netherlands
|
 |
9
|
|
 |
10
|
Junghoo Cho , Hector Garcia-Molina , Taher Haveliwala , Wang Lam , Andreas Paepcke , Sriram Raghavan , Gary Wesley, Stanford WebBase components and applications, ACM Transactions on Internet Technology (TOIT), v.6 n.2, p.153-186, May 2006
[doi> 10.1145/1149121.1149124]
|
| |
11
|
|
| |
12
|
cyvellance. www.cyvellance.com. Cyvellance.
|
| |
13
|
Diestel, R. 1997. Graph Theory. Springer, New York.
|
| |
14
|
Stephen Dill , Ravi Kumar , Kevin S. McCurley , Sridhar Rajagopalan , D. Sivakumar , Andrew Tomkins, Self-similarity in the Web, Proceedings of the 27th International Conference on Very Large Data Bases, p.69-78, September 11-14, 2001
|
| |
15
|
Donato, D., Laura, L., Leonardi, S., and Millozzi, S. 2004a. Large scale properties of the Webgraph. Europ. J. Phys. B 38, 2, 239--243. DOI: 10.1140/epjb/e2004-00056-6.
|
| |
16
|
|
| |
17
|
Donato, D., Laura, L., Leonardi, S., and Millozzi, S. 2004c. A software library for generating and measuring massive Webgraphs. Tech. Rep. D13, COSIN European Research Project. http://www.dis.uniroma1.it/~cosin/html_pages/COSIN-Tools.htm.
|
| |
18
|
Erdös, P. and Rényi, A. 1960. On the evoluation of random graphs Publ. Math. Inst. Hung. Acad. Sci 5.
|
| |
19
|
Gleich, D., Zuchov, L., and Berkhin, P. 2004. Fast Parallel PageRank: A Linear System Approach. Tech. Rep. 038, Yahoo! Research.
|
 |
20
|
|
| |
21
|
Harary, F. 1969. Graph Theory. Addison-Wesley, Reading, MA.
|
| |
22
|
Haveliwala, T. H. 1999. Efficient computation of PageRank. Tech. rep., Stanford University.
|
 |
23
|
|
| |
24
|
Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. 1999. The Web as a graph: measurements, models and methods. In Proceedings of the International Conference on Combinatorics and Computing. 1--18.
|
| |
25
|
|
| |
26
|
Kraft, R., Hastor, E., and Stata, R. 2003. Timelinks: Exploring the link structure of the evolving Web. In Second Workshop on Algorithms and Models for the Web-Graph (WAW2003).
|
| |
27
|
R. Kumar , P. Raghavan , S. Rajagopalan , D. Sivakumar , A. Tomkins , E. Upfal, Stochastic models for the Web graph, Proceedings of the 41st Annual Symposium on Foundations of Computer Science, p.57, November 12-14, 2000
|
| |
28
|
|
| |
29
|
Laura, L., Leonardi, S., Caldarelli, G., and De Los Rios, P. 2002. A multi-layer model for the Webgraph. In On-line proceedings of the 2nd International Workshop on Web Dynamics.
|
| |
30
|
Mitzenmacher, M. 2003. A brief history of generative models for power law and lognormal distributions. Internet Math. 1, 2.
|
| |
31
|
|
| |
32
|
Pennock, D., Flake, G., Lawrence, S., Glover, E., and Giles, C. 2002. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Sciences 99, 8 (April), 5207--5211.
|
 |
33
|
|
| |
34
|
Tarjan, R. E. 1972. Depth-first search and linear graph algorithms. SIAM J. Comput. 1, 2, 146--160.
|
| |
35
|
Vitter, J. and Shriver, E. 1994a. Algorithms for parallel memory i: Two-level memories. Algorithmica 12, 2-3, 107--114.
|
| |
36
|
Vitter, J. and Shriver, E. 1994b. Algorithms for parallel memory ii: Hierarchical multilevel memories. Algorithmica 12, 2-3, 148--169.
|
 |
37
|
|
| |
38
|
webbase. The Stanford Webbase project. http://www-diglib.stanford.edu/~testbed/doc2/WebBase/.
|
| |
39
|
|
|