ACM Home Page
Please provide us with feedback. Feedback
Parallel crawlers
Full text PdfPdf (231 KB)
Source International World Wide Web Conference archive
Proceedings of the 11th international conference on World Wide Web table of contents
Honolulu, Hawaii, USA
SESSION: Crawling table of contents
Pages: 124 - 135  
Year of Publication: 2002
ISBN:1-58113-449-5
Authors
Junghoo Cho  University of California, Los Angeles
Hector Garcia-Molina  Stanford University, Stanford CA
Sponsors
ACM: Association for Computing Machinery
: WWW'02
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 21,   Downloads (12 Months): 184,   Citation Count: 30
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/511446.511464
What is a DOI?

ABSTRACT

In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
A. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286(509), 1999.
 
3
 
4
M. Burner. Crawling towards eterneity: Building an archive of the world wide web. Web Techniques Magazine, 2(5), May 1998.
 
5
 
6
7
 
8
J. Cho and H. Garcia-Molina. Parallel crawlers. Technical report, UCLA Computer Science, 2002.
 
9
 
10
E. Coffman, Jr., Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Technical report, INRIA, 1997.
 
11
 
12
D. Eichmann. The RBSE spider: Balancing effective search against web load. In Proc. of WWW Conf., 1994.
 
13
Google Inc. http://www.google.com.
 
14
 
15
A. Heydon and M. Najork. High-performance web crawling. Technical report, SRC Research Report, 173, Compaq Systems Research Center, September 2001.
16
 
17
M. Koster. Robots in the web: threat or treat? ConneXions, 4(4), April 1995.
 
18
O. A. McBryan. GENVL and WWWW: Tools for taming the web. In Proc. of WWW Conf., 1994.
 
19
20
 
21
 
22
 
23
B. Pinkerton. Finding what people want: Experiences with the web crawler. In Proc. of WWW Conf., 1994.
 
24
Robots exclusion protocol. http://info.webcrawler.com/mak/projects/robots/exclusion.html.
25
 
26
G. K. Zipf. Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley, 1949.

CITED BY  30

Collaborative Colleagues:
Junghoo Cho: colleagues
Hector Garcia-Molina: colleagues