|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ABSTRACT
We study how to prioritize the fetching of new pages under the objective of maximizing the quality of search results. In particular, our objective is to fetch new pages that have the most impact, where the impact of a page is equal to the number of times the page appears in the top K search results for queries, for some constant K, e.g., K = 10. Since the impact of a page depends on its relevance score for queries, which in turn depends on the page content, the main difficulty lies in estimating the impact of the page before actually fetching it. Hence, impact must be estimated based on the limited information that is available prior to fetching page content, e.g., the URL string, number of in-links, referring anchortext We formally characterize this problem and study its hardness. We leverage our formalism to design a new impact-driven crawling policy, and demonstrate its effectiveness using real world data. Our technique ensures that the crawler acquires content relevant to "tail topics" that are obscure but of interest to some users, rather than just redundantly accumulating content on popular topics. REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
INDEX TERMS
Primary Classification:
Additional Classification:
General Terms:
Keywords:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||