|
ABSTRACT
This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly branching spam, legitimate multimillion-page blog sites, and infinite loops created by server-side scripts. We then offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the Web graph with 41 billion unique nodes.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
Boldi, P., Santini, M., and Vigna, S. 2004b. Do your worst to make the best: Paradoxical effects in pagerank incremental computations. In Algorithms and Models for the Web-Graph. Lecture Notes in Computer Science, vol. 3243. Springer,168--180.
|
| |
6
|
|
| |
7
|
|
 |
8
|
|
| |
9
|
Burner, M. 1997. Crawling towards eternity: Building an archive of the World Wide Web. Web Techn. Mag. 2, 5.
|
 |
10
|
|
 |
11
|
|
 |
12
|
Junghoo Cho , Hector Garcia-Molina , Taher Haveliwala , Wang Lam , Andreas Paepcke , Sriram Raghavan , Gary Wesley, Stanford WebBase components and applications, ACM Transactions on Internet Technology (TOIT), v.6 n.2, p.153-186, May 2006
[doi> 10.1145/1149121.1149124]
|
 |
13
|
|
| |
14
|
Eichmann, D. 1994. The rbse spider -- Balancing effective search against Web load. In World Wide Web Conference.
|
 |
15
|
Guang Feng , Tie-Yan Liu , Ying Wang , Ying Bao , Zhiming Ma , Xu-Dong Zhang , Wei-Ying Ma, AggregateRank: bringing order to web sites, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
[doi> 10.1145/1148170.1148187]
|
| |
16
|
Gleich, D. and Zhukov, L. 2005. Scalable computing for power law graphs: Experience with parallel pagerank. In Proceedings of SuperComputing.
|
| |
17
|
|
 |
18
|
|
 |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
Internet Archive. Internet archive homepage. http://www.archive.org/.
|
| |
23
|
IRLbot. 2007. IRLbot project at Texas A&M. http://irl.cs.tamu.edu/crawler/.
|
| |
24
|
Kamvar, S. D., Haveliwala, T. H., Manning, C. D., and Golub, G. H. 2003a. Exploiting the block structure of the Web for computing pagerank. Tech. rep., Stanford University.
|
 |
25
|
|
| |
26
|
Koht-arsa, K. and Sanguanpong, S. 2002. High-performance large scale Web spider architecture. In International Symposium on Communications and Information Technology.
|
| |
27
|
|
 |
28
|
|
| |
29
|
Mauldin, M. 1997. Lycos: Design choices in an Internet search service. IEEE Expert Mag. 12, 1, 8--11.
|
| |
30
|
McBryan, O. A. 1994. Genvl and wwww: Tools for taming the Web. In World Wide Web Conference (WWW'94).
|
| |
31
|
Najork, M. and Heydon, A. 2001. High-performance Web crawling. Tech: rep. 173, Compaq Systems Research Center.
|
 |
32
|
|
| |
33
|
Official Google Blog. 2008. We knew the Web was big… http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.
|
| |
34
|
Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In World Wide Web Conference (WWW'94).
|
| |
35
|
|
| |
36
|
|
| |
37
|
Singh, A., Srivatsa, M., Liu, L., and Miller, T. 2003. Apoidea: A decentralized peer-to-peer architecture for crawling the World Wide Web. In Proceedings of the ACM SIGIR Workshop on Distributed Information Retrieval. 126--142.
|
| |
38
|
Suel, T., Mathur, C., Wu, J., Zhang, J., Delis, A., Kharrazi, M., Long, X., and Shanmugasundaram, K. 2003. Odissea: A peer-to-peer architecture for scalable Web search and information retrieval. In Proceedings of the International Workshop on Web and Databases (WebDB'03). 67--72.
|
 |
39
|
|
| |
40
|
Wu, J. and Aberer, K. 2004. Using siterank for decentralized computation of Web document ranking. In Proceedings of the International Conference on Adaptive Hypermedia, 265--274.
|
|