ACM Home Page
Please provide us with feedback. Feedback
Crawl ordering by search impact
Full text PdfPdf (940 KB)
Source
Web Search and Web Data Mining archive
Proceedings of the international conference on Web search and web data mining table of contents
Palo Alto, California, USA
SESSION: Crawling table of contents
Pages 3-14  
Year of Publication: 2008
ISBN:978-1-59593-927-9
Authors
Sandeep Pandey  Carnegie Mellon University
Christopher Olston  Yahoo! Research
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 199,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1341531.1341535
What is a DOI?

ABSTRACT

We study how to prioritize the fetching of new pages under the objective of maximizing the quality of search results. In particular, our objective is to fetch new pages that have the most impact, where the impact of a page is equal to the number of times the page appears in the top K search results for queries, for some constant K, e.g., K = 10. Since the impact of a page depends on its relevance score for queries, which in turn depends on the page content, the main difficulty lies in estimating the impact of the page before actually fetching it. Hence, impact must be estimated based on the limited information that is available prior to fetching page content, e.g., the URL string, number of in-links, referring anchortext

We formally characterize this problem and study its hardness. We leverage our formalism to design a new impact-driven crawling policy, and demonstrate its effectiveness using real world data. Our technique ensures that the crawler acquires content relevant to "tail topics" that are obscure but of interest to some users, rather than just redundantly accumulating content on popular topics.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
G. Casella and R. L. Berger. Statistical Inference. Duxbury, 2001.
 
4
 
5
6
 
7
8
9
10
11
 
12
13
14
15
16
 
17
Search's Long Tail. http://blog.searchenginewatch.com/blog/050314-164653.
 
18
 
19
20
 
21
K. Yi, H. Yu, J. Yang, G. Xia, and Y. Chen. Efficient Maintenance of Materialized Top-k Views. In Proc. International Conference on Data Engineering, 2003.


Collaborative Colleagues:
Sandeep Pandey: colleagues
Christopher Olston: colleagues