ACM Home Page
Please provide us with feedback. Feedback
User-centric Web crawling
Full text PdfPdf (915 KB)
Source International World Wide Web Conference archive
Proceedings of the 14th international conference on World Wide Web table of contents
Chiba, Japan
SESSION: User-focused search and crawling table of contents
Pages: 401 - 411  
Year of Publication: 2005
ISBN:1-59593-046-9
Authors
Sandeep Pandey  Carnegie Mellon University, Pittsburgh, PA
Christopher Olston  Carnegie Mellon University, Pittsburgh, PA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 27,   Downloads (12 Months): 113,   Citation Count: 16
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1060745.1060805
What is a DOI?

ABSTRACT

Search engines are the primary gateways of information access on the Web today. Behind the scenes, search engines crawl the Web to populate a local indexed repository of Web pages, used to answer user search queries. In an aggregate sense, the Web is very dynamic, causing any repository of Web pages to become out of date over time, which in turn causes query answer quality to degrade. Given the considerable size, dynamicity, and degree of autonomy of the Web as a whole, it is not feasible for a search engine to maintain its repository exactly synchronized with the Web.In this paper we study how to schedule Web pages for selective (re)downloading into a search engine repository. The scheduling objective is to maximize the quality of the user experience for those who query the search engine. We begin with a quantitative characterization of the way in which the discrepancy between the content of the repository and the current content of the live Web impacts the quality of the user experience. This characterization leads to a user-centric metric of the quality of a search engine's local repository. We use this metric to derive a policy for scheduling Web page (re)downloading that is driven by search engine usage and free of exterior tuning parameters. We then focus on the important subproblem of scheduling refreshing of Web pages already present in the repository, and show how to compute the priorities efficiently. We provide extensive empirical comparisons of our user-centric method against prior Web page refresh strategies, using real Web data. Our results demonstrate that our method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
AltaVista Query Log. http://ftp.archive.org/AVLogs/.
 
2
Jakarta Lucene. http://jakarta.apache.org/lucene/docs/index.html.
 
3
Open Directory Project. http://www.dmoz.org/.
 
4
UCLA WebArchive. http://webarchive.cs.ucla.edu/.
5
 
6
 
7
 
8
9
 
10
11
12
13
14
15
16
 
17
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Project Technical Report, 1998.
 
18
G. Salton and C. S. Yang. On the Specification of Term Values in Automatic Indexing. Documentation, 29(351--372), 1973.
 
19
D. Sullivan. Searches Per Day, SearchEngineWatch. http://searchenginewatch.com/reports/article.php/2156461.
 
20
T. Upstill, N. Craswell, and D. Hawking. Predicting Fame and Fortune: PageRank or Indegree? In Proceedings of the Eighth Australasian Document Computing Symposium, 2003.
21

CITED BY  16

Collaborative Colleagues:
Sandeep Pandey: colleagues
Christopher Olston: colleagues