ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Crawling a country: better strategies than breadth-first for web page ordering
Full text PdfPdf (276 KB)
Source International World Wide Web Conference archive
Special interest tracks and posters of the 14th international conference on World Wide Web table of contents
Chiba, Japan
SESSION: Industrial and practical experience track paper session 2 table of contents
Pages: 864 - 872  
Year of Publication: 2005
ISBN:1-59593-051-5
Authors
Ricardo Baeza-Yates  Universidad de Chile
Carlos Castillo  Universidad de Chile
Mauricio Marin  Universidad de Magallanes
Andrea Rodriguez  Universidad de Concepcion
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 140,   Citation Count: 16
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1062745.1062768
What is a DOI?

ABSTRACT

This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most "important" pages "early" during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages.We use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Robotcop. www.robotcop.org, 2002.
 
2
HT://Dig. http://www.htdig.org/, 2004. GPL software.
3
 
4
S. Ailleret. Larbin. http://larbin.sourceforge.net/index-eng.html, 2004. GPL software.
5
 
6
R. Baeza-Yates and C. Castillo. Balancing volume, quality and freshness in web crawling. In Soft Computing Systems - Design, Management and Applications, pages 565--572, Santiago, Chile, 2002. IOS Press Amsterdam.
 
7
R. Baeza-Yates and C. Castillo. Crawling the infinite Web: Five levels are enough. In Proceedings of the third Workshop on Web Graphs (WAW), volume 3243 of Lecture Notes in Computer Science, pages 156--167, Rome, Italy, October 2004. Springer.
 
8
 
9
 
10
 
11
P. Boldi, M. Santini, and S. Vigna. Do your worst to make the best: Paradoxical effects in pagerank incremental computations. In Proceedings of the third Workshop on Web Graphs (WAW), volume 3243 of Lecture Notes in Computer Science, pages 168--180, Rome, Italy, October 2004. Springer.
 
12
O. Brandman, J. Cho, H. Garcia-Molina, and N. Shivakumar. Crawler-friendly web servers. In Proceedings of the Workshop on Performance and Architecture of Web Servers (PAWS), Santa Clara, California, USA, June 2000.
 
13
14
 
15
M. Burner. Crawling towards eternity - building an archive of the world wide web. Web Techniques, 2(5), May 1997.
 
16
 
17
S. Chakrabarti. Mining the Web. Morgan Kaufmann Publishers, 2003.
 
18
 
19
J. Cho and R. Adams. Page quality: In search of an unbiased Web ranking. Technical report, UCLA Computer Science, 2004.
20
21
 
22
23
 
24
 
25
 
26
 
27
L. Dacharay. WebBase. http://freesoftware.fsf.org/webbase/, 2002. GPL Software.
 
28
29
 
30
R. W. Edward G. Coffman, Z. Liu. Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1):15--29, 1998.
31
 
32
D. Eichmann. The RBSE spider: balancing effective search against web load. In Proceedings of the first World Wide Web Conference, Geneva, Switzerland, May 1994.
33
34
35
 
36
 
37
 
38
 
39
M. G. Kendall. Rank Correlation Methods. Griffin, London, England, 1970.
 
40
M. Koster. Robots in the web: threat or treat ? ConneXions, 9(4), April 1995.
 
41
O. A. McBryan. GENVL and WWWW: Tools for taming the web. In Proceedings of the first World Wide Web Conference, Geneva, Switzerland, May 1994.
 
42
43
 
44
45
46
 
47
L. Page, S. Brin, R. Motwani, and T. Winograd. The Pagerank citation algorithm: bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.
 
48
G. Pant, S. Bradshaw, and F. Menczer. Search engine-crawler symbiosis. In Proceedings of the European Conference on Digital Libraries (ECDL), volume 2769 of Lecture Notes in Computer Science, pages 221--232. Springer, August 2003.
 
49
B. Pinkerton. Finding what people want: Experiences with the WebCrawler. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland, May 1994.
 
50
 
51
K. M. Risvik and R. Michelsen. Search engines and web dynamics. Computer Networks, 39(3), June 2002.
 
52
53
 
54
 
55
The Economist. Country Profiles, 2002.
 
56
United Nations. Population Division, 2002.
 
57
United Nations. Human Development Reports, 2003.
 
58
 
59
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. Making eigenvector-based reputation systems robust to collusion. In Proceedings of the third Workshop on Web Graphs (WAW), volume 3243 of Lecture Notes in Computer Science, pages 92--104, Rome, Italy, October 2004. Springer.

CITED BY  16

Collaborative Colleagues:
Ricardo Baeza-Yates: colleagues
Carlos Castillo: colleagues
Mauricio Marin: colleagues
Andrea Rodriguez: colleagues