ACM Home Page
Please provide us with feedback. Feedback
The discoverability of the web
Full text PdfPdf (326 KB)
Source
International World Wide Web Conference archive
Proceedings of the 16th international conference on World Wide Web table of contents
Banff, Alberta, Canada
SESSION: Crawlers table of contents
Pages: 421 - 430  
Year of Publication: 2007
ISBN:978-1-59593-654-7
Authors
Anirban Dasgupta  Yahoo! Research, Sunnyvale, CA
Arpita Ghosh  Yahoo! Research, Sunnyvale, CA
Ravi Kumar  Yahoo! Research, Sunnyvale, CA
Christopher Olston  Yahoo! Research, Sunnyvale, CA
Sandeep Pandey  Yahoo! Research, Sunnyvale, CA
Andrew Tomkins  Yahoo! Research, Sunnyvale, CA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 85,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1242572.1242630
What is a DOI?

ABSTRACT

Previous studies have highlighted the high arrival rate of new contenton the web. We study the extent to which this new content can beefficiently discovered by a crawler. Our study has two parts. First,we study the inherent difficulty of the discovery problem using amaximum cover formulation, under an assumption of perfect estimates oflikely sources of links to new content. Second, we relax thisassumption and study a more realistic setting in which algorithms mustuse historical statistics to estimate which pages are most likely toyield links to new content. We recommend a simple algorithm thatperforms comparably to all approaches we consider.We measure the emphoverhead of discovering new content, defined asthe average number of fetches required to discover one new page. Weshow first that with perfect foreknowledge of where to explore forlinks to new content, it is possible to discover 90% of all newcontent with under 3% overhead, and 100% of new content with 9%overhead. But actual algorithms, which do not have access to perfectforeknowledge, face a more difficult task: one quarter of new contentis simply not amenable to efficient discovery. Of the remaining threequarters, 80% of new content during a given week may be discoveredwith 160% overhead if content is recrawled fully on a monthly basis.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509--512, 1999.
 
3
 
4
 
5
6
 
7
 
8
 
9
J. ECoffman, Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1):15--29, 1998.
10
11
12
 
13
 
14
 
15
J. Gittins. Bandit Processes and Dynamic Allocation Indices. John Wiley, 1989.
 
16
 
17
 
18
M. Mitzenmacher. A brief history of lognormal and power law distributions. Internet Mathematics, 1(2):226--251, 2004.
19
20
21
22
 
23
 
24
25


Collaborative Colleagues:
Anirban Dasgupta: colleagues
Arpita Ghosh: colleagues
Ravi Kumar: colleagues
Christopher Olston: colleagues
Sandeep Pandey: colleagues
Andrew Tomkins: colleagues