ACM Home Page
Please provide us with feedback. Feedback
IRLbot: Scaling to 6 billion pages and beyond
Full text PdfPdf (649 KB)
Source
ACM Transactions on the Web (TWEB) archive
Volume 3 ,  Issue 3  (June 2009) table of contents
Article No. 8  
Year of Publication: 2009
ISSN:1559-1131
Authors
Hsin-Tsang Lee  Texas A&M University, College Station, TX
Derek Leonard  Texas A&M University, College Station, TX
Xiaoming Wang  Texas A&M University, College Station, TX
Dmitri Loguinov  Texas A&M University, College Station, TX
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 37,   Downloads (12 Months): 192,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1541822.1541823
What is a DOI?

ABSTRACT

This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly branching spam, legitimate multimillion-page blog sites, and infinite loops created by server-side scripts. We then offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the Web graph with 41 billion unique nodes.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
 
4
 
5
Boldi, P., Santini, M., and Vigna, S. 2004b. Do your worst to make the best: Paradoxical effects in pagerank incremental computations. In Algorithms and Models for the Web-Graph. Lecture Notes in Computer Science, vol. 3243. Springer,168--180.
 
6
 
7
8
 
9
Burner, M. 1997. Crawling towards eternity: Building an archive of the World Wide Web. Web Techn. Mag. 2, 5.
10
11
12
13
 
14
Eichmann, D. 1994. The rbse spider -- Balancing effective search against Web load. In World Wide Web Conference.
15
 
16
Gleich, D. and Zhukov, L. 2005. Scalable computing for power law graphs: Experience with parallel pagerank. In Proceedings of SuperComputing.
 
17
18
19
 
20
 
21
 
22
Internet Archive. Internet archive homepage. http://www.archive.org/.
 
23
IRLbot. 2007. IRLbot project at Texas A&M. http://irl.cs.tamu.edu/crawler/.
 
24
Kamvar, S. D., Haveliwala, T. H., Manning, C. D., and Golub, G. H. 2003a. Exploiting the block structure of the Web for computing pagerank. Tech. rep., Stanford University.
25
 
26
Koht-arsa, K. and Sanguanpong, S. 2002. High-performance large scale Web spider architecture. In International Symposium on Communications and Information Technology.
 
27
28
 
29
Mauldin, M. 1997. Lycos: Design choices in an Internet search service. IEEE Expert Mag. 12, 1, 8--11.
 
30
McBryan, O. A. 1994. Genvl and wwww: Tools for taming the Web. In World Wide Web Conference (WWW'94).
 
31
Najork, M. and Heydon, A. 2001. High-performance Web crawling. Tech: rep. 173, Compaq Systems Research Center.
32
 
33
Official Google Blog. 2008. We knew the Web was big… http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.
 
34
Pinkerton, B. 1994. Finding what people want: Experiences with the Web crawler. In World Wide Web Conference (WWW'94).
 
35
 
36
 
37
Singh, A., Srivatsa, M., Liu, L., and Miller, T. 2003. Apoidea: A decentralized peer-to-peer architecture for crawling the World Wide Web. In Proceedings of the ACM SIGIR Workshop on Distributed Information Retrieval. 126--142.
 
38
Suel, T., Mathur, C., Wu, J., Zhang, J., Delis, A., Kharrazi, M., Long, X., and Shanmugasundaram, K. 2003. Odissea: A peer-to-peer architecture for scalable Web search and information retrieval. In Proceedings of the International Workshop on Web and Databases (WebDB'03). 67--72.
39
 
40
Wu, J. and Aberer, K. 2004. Using siterank for decentralized computation of Web document ranking. In Proceedings of the International Conference on Adaptive Hypermedia, 265--274.

Collaborative Colleagues:
Hsin-Tsang Lee: colleagues
Derek Leonard: colleagues
Xiaoming Wang: colleagues
Dmitri Loguinov: colleagues