|
ABSTRACT
This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most "important" pages "early" during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages.We use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Robotcop. www.robotcop.org, 2002.
|
| |
2
|
HT://Dig. http://www.htdig.org/, 2004. GPL software.
|
 |
3
|
|
| |
4
|
S. Ailleret. Larbin. http://larbin.sourceforge.net/index-eng.html, 2004. GPL software.
|
 |
5
|
|
| |
6
|
R. Baeza-Yates and C. Castillo. Balancing volume, quality and freshness in web crawling. In Soft Computing Systems - Design, Management and Applications, pages 565--572, Santiago, Chile, 2002. IOS Press Amsterdam.
|
| |
7
|
R. Baeza-Yates and C. Castillo. Crawling the infinite Web: Five levels are enough. In Proceedings of the third Workshop on Web Graphs (WAW), volume 3243 of Lecture Notes in Computer Science, pages 156--167, Rome, Italy, October 2004. Springer.
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
P. Boldi, M. Santini, and S. Vigna. Do your worst to make the best: Paradoxical effects in pagerank incremental computations. In Proceedings of the third Workshop on Web Graphs (WAW), volume 3243 of Lecture Notes in Computer Science, pages 168--180, Rome, Italy, October 2004. Springer.
|
| |
12
|
O. Brandman, J. Cho, H. Garcia-Molina, and N. Shivakumar. Crawler-friendly web servers. In Proceedings of the Workshop on Performance and Architecture of Web Servers (PAWS), Santa Clara, California, USA, June 2000.
|
| |
13
|
|
 |
14
|
|
| |
15
|
M. Burner. Crawling towards eternity - building an archive of the world wide web. Web Techniques, 2(5), May 1997.
|
| |
16
|
|
| |
17
|
S. Chakrabarti. Mining the Web. Morgan Kaufmann Publishers, 2003.
|
| |
18
|
|
| |
19
|
J. Cho and R. Adams. Page quality: In search of an unbiased Web ranking. Technical report, UCLA Computer Science, 2004.
|
 |
20
|
|
 |
21
|
|
| |
22
|
|
 |
23
|
Junghoo Cho , Narayanan Shivakumar , Hector Garcia-Molina, Finding replicated Web collections, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.355-366, May 15-18, 2000, Dallas, Texas, United States
|
| |
24
|
Nick Craswell , Francis Crimmins , David Hawking , Alistair Moffat, Performance and cost tradeoffs in Web search, Proceedings of the fifteenth Australasian database conference, p.161-169, January 01, 2004, Dunedin, New Zealand
|
| |
25
|
Artur Czumaj , Ian Finch , Leszek Gąsieniec , Alan Gibbons , Paul Leng , Wojciech Rytter , Michele Zito, Efficient web searching using temporal factors, Theoretical Computer Science, v.262 n.1-2, p.569-582, July 2001
[doi> 10.1016/S0304-3975(00)00366-2]
|
| |
26
|
Altigran S. da Silva , Eveline A. Veloso , Paulo B. Golghe , Berthier Ribeiro-Neto , Alberto H. F. Laender , Nivio Ziviani, CoBWeb A Crawler for the Brazilian Web, Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware, p.184, September 21-24, 1999
|
| |
27
|
L. Dacharay. WebBase. http://freesoftware.fsf.org/webbase/, 2002. GPL Software.
|
| |
28
|
|
 |
29
|
Stephen Dill , Ravi Kumar , Kevin S. Mccurley , Sridhar Rajagopalan , D. Sivakumar , Andrew Tomkins, Self-similarity in the web, ACM Transactions on Internet Technology (TOIT), v.2 n.3, p.205-223, August 2002
[doi> 10.1145/572326.572328]
|
| |
30
|
R. W. Edward G. Coffman, Z. Liu. Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1):15--29, 1998.
|
 |
31
|
|
| |
32
|
D. Eichmann. The RBSE spider: balancing effective search against web load. In Proceedings of the first World Wide Web Conference, Geneva, Switzerland, May 1994.
|
 |
33
|
|
 |
34
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
 |
35
|
|
| |
36
|
|
| |
37
|
Michael Hersovici , Michal Jacovi , Yoelle S. Maarek , Dan Pelleg , Menanchem Shtalhaim , Sigalit Ur, The shark-search algorithm. An application: tailored Web site mapping, Proceedings of the seventh international conference on World Wide Web 7, p.317-326, April 1998, Brisbane, Australia
|
| |
38
|
|
| |
39
|
M. G. Kendall. Rank Correlation Methods. Griffin, London, England, 1970.
|
| |
40
|
M. Koster. Robots in the web: threat or treat ? ConneXions, 9(4), April 1995.
|
| |
41
|
O. A. McBryan. GENVL and WWWW: Tools for taming the web. In Proceedings of the first World Wide Web Conference, Geneva, Switzerland, May 1994.
|
| |
42
|
|
 |
43
|
Filippo Menczer , Gautam Pant , Padmini Srinivasan , Miguel E. Ruiz, Evaluating topic-driven web crawlers, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.241-249, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383995]
|
| |
44
|
|
 |
45
|
|
 |
46
|
|
| |
47
|
L. Page, S. Brin, R. Motwani, and T. Winograd. The Pagerank citation algorithm: bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.
|
| |
48
|
G. Pant, S. Bradshaw, and F. Menczer. Search engine-crawler symbiosis. In Proceedings of the European Conference on Digital Libraries (ECDL), volume 2769 of Lecture Notes in Computer Science, pages 221--232. Springer, August 2003.
|
| |
49
|
B. Pinkerton. Finding what people want: Experiences with the WebCrawler. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland, May 1994.
|
| |
50
|
|
| |
51
|
K. M. Risvik and R. Michelsen. Search engines and web dynamics. Computer Networks, 39(3), June 2002.
|
| |
52
|
|
 |
53
|
J. Talim , Z. Liu , Ph. Nain , E. G. Coffman, Jr., Controlling the robots of Web search engines, Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, p.236-244, June 2001, Cambridge, Massachusetts, United States
|
| |
54
|
|
| |
55
|
The Economist. Country Profiles, 2002.
|
| |
56
|
United Nations. Population Division, 2002.
|
| |
57
|
United Nations. Human Development Reports, 2003.
|
| |
58
|
|
| |
59
|
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. Making eigenvector-based reputation systems robust to collusion. In Proceedings of the third Workshop on Web Graphs (WAW), volume 3243 of Lecture Notes in Computer Science, pages 92--104, Rome, Italy, October 2004. Springer.
|
CITED BY 15
|
|
Yida Wang , Jiang-Ming Yang , Wei Lai , Rui Cai , Lei Zhang , Wei-Ying Ma, Exploring traversal strategy for web forum crawling, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
B. Barla Cambazoglu , Evren Karaca , Tayfun Kucukyilmaz , Ata Turk , Cevdet Aykanat, Architecture of a grid-enabled Web search engine, Information Processing and Management: an International Journal, v.43 n.3, p.609-623, May, 2007
|
|
|
|
|
|
Rui Cai , Jiang-Ming Yang , Wei Lai , Yida Wang , Lei Zhang, iRobot: an intelligent crawler for web forums, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jiang-Ming Yang , Rui Cai , Chunsong Wang , Hua Huang , Lei Zhang , Wei-Ying Ma, Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|