ACM Home Page
Please provide us with feedback. Feedback
The impact of crawl policy on web search effectiveness
Full text PdfPdf (1.12 MB)
Source
Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval table of contents
Boston, MA, USA
SESSION: Web Retrieval II table of contents
Pages 580-587  
Year of Publication: 2009
ISBN:978-1-60558-483-6
Authors
Dennis Fetterly  Microsoft Research, Mountain View, CA, USA
Nick Craswell  Microsoft Research, Cambridge, United Kingdom
Vishwa Vinay  Microsoft Research, Cambridge, United Kingdom
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 50,   Downloads (12 Months): 159,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1571941.1572041
What is a DOI?

ABSTRACT

Crawl selection policy has a direct influence on Web search effectiveness, because a useful page that is not selected for crawling will also be absent from search results. Yet there has been little or no work on measuring this effect. We introduce an evaluation framework, based on relevance judgments pooled from multiple search engines, measuring the maximum potential NDCG that is achievable using a particular crawl. This allows us to evaluate different crawl policies and investigate important scenarios like selection stability over multiple iterations. We conduct two sets of crawling experiments at the scale of 1~billion and 100~million pages respectively. These show that crawl selection based on PageRank, indegree and trans-domain indegree all allow better retrieval effectiveness than a simple breadth-first crawl of the same size. PageRank is the most reliable and effective method. Trans-domain indegree can outperform PageRank, but over multiple crawl iterations it is less effective and more unstable. Finally we experiment with combinations of crawl selection methods and per-domain page limits, which yield crawls with greater potential NDCG than PageRank.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
R. Baeza-Yates and C. Castillo. Crawling the infinite web. Journal of Web Engineering, 6(1):49--72, 2007.
3
4
 
5
P. Boldi, and M. Santini, and S. Vigna. Paradoxical effects in pagerank incremental computations. Internet Mathematics, 2(3):387--404, 2005.
 
6
 
7
 
8
9
 
10
11
 
12
13
14
15
16
17
 
18
19

Collaborative Colleagues:
Dennis Fetterly: colleagues
Nick Craswell: colleagues
Vishwa Vinay: colleagues