| The discoverability of the web |
| Full text |
Pdf
(326 KB)
|
Source
|
International World Wide Web Conference
archive
Proceedings of the 16th international conference on World Wide Web
table of contents
Banff, Alberta, Canada
SESSION: Crawlers
table of contents
Pages: 421 - 430
Year of Publication: 2007
ISBN:978-1-59593-654-7
|
|
Authors
|
|
Anirban Dasgupta
|
Yahoo! Research, Sunnyvale, CA
|
|
Arpita Ghosh
|
Yahoo! Research, Sunnyvale, CA
|
|
Ravi Kumar
|
Yahoo! Research, Sunnyvale, CA
|
|
Christopher Olston
|
Yahoo! Research, Sunnyvale, CA
|
|
Sandeep Pandey
|
Yahoo! Research, Sunnyvale, CA
|
|
Andrew Tomkins
|
Yahoo! Research, Sunnyvale, CA
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 8, Downloads (12 Months): 88, Citation Count: 5
|
|
|
ABSTRACT
Previous studies have highlighted the high arrival rate of new contenton the web. We study the extent to which this new content can beefficiently discovered by a crawler. Our study has two parts. First,we study the inherent difficulty of the discovery problem using amaximum cover formulation, under an assumption of perfect estimates oflikely sources of links to new content. Second, we relax thisassumption and study a more realistic setting in which algorithms mustuse historical statistics to estimate which pages are most likely toyield links to new content. We recommend a simple algorithm thatperforms comparably to all approaches we consider.We measure the emphoverhead of discovering new content, defined asthe average number of fetches required to discover one new page. Weshow first that with perfect foreknowledge of where to explore forlinks to new content, it is possible to discover 90% of all newcontent with under 3% overhead, and 100% of new content with 9%overhead. But actual algorithms, which do not have access to perfectforeknowledge, face a more difficult task: one quarter of new contentis simply not amenable to efficient discovery. Of the remaining threequarters, 80% of new content during a given week may be discoveredwith 160% overhead if content is recrawled fully on a monthly basis.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509--512, 1999.
|
| |
3
|
|
| |
4
|
Andrei Broder , Ravi Kumar , Farzin Maghoul , Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata , Andrew Tomkins , Janet Wiener, Graph structure in the Web, Computer Networks: The International Journal of Computer and Telecommunications Networking, v.33 n.1-6, p.309-320, June 2000
|
| |
5
|
|
 |
6
|
|
| |
7
|
|
| |
8
|
Fred Douglis , Anja Feldmann , Balachander Krishnamurthy , Jeffrey Mogul, Rate of change and other metrics: a live study of the world wide web, Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems, p.14-14, December 08-11, 1997, Monterey, California
|
| |
9
|
J. ECoffman, Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1):15--29, 1998.
|
 |
10
|
|
 |
11
|
|
 |
12
|
Michalis Faloutsos , Petros Faloutsos , Christos Faloutsos, On power-law relationships of the Internet topology, Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication, p.251-262, August 30-September 03, 1999, Cambridge, Massachusetts, United States
|
| |
13
|
|
| |
14
|
|
| |
15
|
J. Gittins. Bandit Processes and Dynamic Allocation Indices. John Wiley, 1989.
|
| |
16
|
|
| |
17
|
|
| |
18
|
M. Mitzenmacher. A brief history of lognormal and power law distributions. Internet Mathematics, 1(2):226--251, 2004.
|
 |
19
|
|
 |
20
|
|
 |
21
|
|
 |
22
|
James Pitkow , Peter Pirolli, Life, death, and lawfulness on the electronic frontier, Proceedings of the SIGCHI conference on Human factors in computing systems, p.383-390, March 22-27, 1997, Atlanta, Georgia, United States
[doi> 10.1145/258549.258805]
|
| |
23
|
|
| |
24
|
|
 |
25
|
J. L. Wolf , M. S. Squillante , P. S. Yu , J. Sethuraman , L. Ozsen, Optimal crawling strategies for web search engines, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511465]
|
|