|
ABSTRACT
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
T. E. Anderson , M. D. Dahlin , J. M. Neefe , D. A. Patterson , D. S. Roselli , R. Y. Wang, Serverless network file systems, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.109-126, December 03-06, 1995, Copper Mountain, Colorado, United States
|
| |
2
|
A. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286(509), 1999.
|
| |
3
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.309-320, June 2000, Amsterdam, The Netherlands
|
| |
4
|
M. Burner. Crawling towards eterneity: Building an archive of the world wide web. Web Techniques Magazine, 2(5), May 1998.
|
| |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
J. Cho and H. Garcia-Molina. Parallel crawlers. Technical report, UCLA Computer Science, 2002.
|
| |
9
|
|
| |
10
|
E. Coffman, Jr., Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Technical report, INRIA, 1997.
|
| |
11
|
|
| |
12
|
D. Eichmann. The RBSE spider: Balancing effective search against web load. In Proc. of WWW Conf., 1994.
|
| |
13
|
Google Inc. http://www.google.com.
|
| |
14
|
|
| |
15
|
A. Heydon and M. Najork. High-performance web crawling. Technical report, SRC Research Report, 173, Compaq Systems Research Center, September 2001.
|
 |
16
|
|
| |
17
|
M. Koster. Robots in the web: threat or treat? ConneXions, 4(4), April 1995.
|
| |
18
|
O. A. McBryan. GENVL and WWWW: Tools for taming the web. In Proc. of WWW Conf., 1994.
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
B. Pinkerton. Finding what people want: Experiences with the web crawler. In Proc. of WWW Conf., 1994.
|
| |
24
|
Robots exclusion protocol. http://info.webcrawler.com/mak/projects/robots/exclusion.html.
|
 |
25
|
|
| |
26
|
G. K. Zipf. Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley, 1949.
|
CITED BY 30
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
José Exposto , Joaquim Macedo , António Pina , Albano Alves , José Rufino, Geographical partition for distributed web crawling, Proceedings of the 2005 workshop on Geographic information retrieval, November 04-04, 2005, Bremen, Germany
|
|
|
Yao-Wen Huang , Chung-Hung Tsai , Tsung-Po Lin , Shih-Kun Huang , D. T. Lee , Sy-Yen Kuo, A testing framework for Web application security assessment, Computer Networks: The International Journal of Computer and Telecommunications Networking, v.48 n.5, p.739-761, 5 August 2005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
B. Barla Cambazoglu , Evren Karaca , Tayfun Kucukyilmaz , Ata Turk , Cevdet Aykanat, Architecture of a grid-enabled Web search engine, Information Processing and Management: an International Journal, v.43 n.3, p.609-623, May, 2007
|
|
|
|
|
|
|
|
|
Duen Horng Chau , Shashank Pandit , Samuel Wang , Christos Faloutsos, Parallel crawling for online social networks, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|