| The impact of crawl policy on web search effectiveness |
| Full text |
Pdf
(1.12 MB)
|
Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
table of contents
Boston, MA, USA
SESSION: Web Retrieval II
table of contents
Pages 580-587
Year of Publication: 2009
ISBN:978-1-60558-483-6
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 50, Downloads (12 Months): 159, Citation Count: 0
|
|
|
ABSTRACT
Crawl selection policy has a direct influence on Web search effectiveness, because a useful page that is not selected for crawling will also be absent from search results. Yet there has been little or no work on measuring this effect. We introduce an evaluation framework, based on relevance judgments pooled from multiple search engines, measuring the maximum potential NDCG that is achievable using a particular crawl. This allows us to evaluate different crawl policies and investigate important scenarios like selection stability over multiple iterations. We conduct two sets of crawling experiments at the scale of 1~billion and 100~million pages respectively. These show that crawl selection based on PageRank, indegree and trans-domain indegree all allow better retrieval effectiveness than a simple breadth-first crawl of the same size. PageRank is the most reliable and effective method. Trans-domain indegree can outperform PageRank, but over multiple crawl iterations it is less effective and more unstable. Finally we experiment with combinations of crawl selection methods and per-domain page limits, which yield crawls with greater potential NDCG than PageRank.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
R. Baeza-Yates and C. Castillo. Crawling the infinite web. Journal of Web Engineering, 6(1):49--72, 2007.
|
 |
3
|
|
 |
4
|
Ziv Bar-Yossef , Andrei Z. Broder , Ravi Kumar , Andrew Tomkins, Sic transit gloria telae: towards an understanding of the web's decay, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988716]
|
| |
5
|
P. Boldi, and M. Santini, and S. Vigna. Paradoxical effects in pagerank incremental computations. Internet Mathematics, 2(3):387--404, 2005.
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
 |
9
|
Anirban Dasgupta , Arpita Ghosh , Ravi Kumar , Christopher Olston , Sandeep Pandey , Andrew Tomkins, The discoverability of the web, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
[doi> 10.1145/1242572.1242630]
|
| |
10
|
|
 |
11
|
|
| |
12
|
|
 |
13
|
|
 |
14
|
|
 |
15
|
|
 |
16
|
|
 |
17
|
|
| |
18
|
|
 |
19
|
|
|