| Recrawl scheduling based on information longevity |
| Full text |
Pdf
(420 KB)
|
Source
|
International World Wide Web Conference
archive
Proceeding of the 17th international conference on World Wide Web
table of contents
Beijing, China
SESSION: Search: crawlers
table of contents
Pages 437-446
Year of Publication: 2008
ISBN:978-1-60558-085-2
|
|
Authors
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 18, Downloads (12 Months): 147, Citation Count: 6
|
|
|
ABSTRACT
It is crucial for a web crawler to distinguish between ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the time it reaches the index it is no longer representative of the web page from which it was acquired. On the other hand, content that persists across multiple page updates (e.g., recent blog postings) may be worth acquiring, because it matches the page's true content for a sustained period of time. In this paper we characterize the longevity of information found on the web, via both empirical measurements and a generative model that coincides with these measurements. We then develop new recrawl scheduling policies that take longevity into account. As we show via experiments over real web data, our policies obtain better freshness at lower cost, compared with previous approaches.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Ziv Bar-Yossef , Andrei Z. Broder , Ravi Kumar , Andrew Tomkins, Sic transit gloria telae: towards an understanding of the web's decay, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988716]
|
| |
2
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
3
|
|
 |
4
|
|
 |
5
|
|
| |
6
|
E. Coffman, Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1, 1998.
|
 |
7
|
|
 |
8
|
|
 |
9
|
|
 |
10
|
|
| |
11
|
The Open Directory Project. http://dmoz.org.
|
 |
12
|
|
 |
13
|
J. L. Wolf , M. S. Squillante , P. S. Yu , J. Sethuraman , L. Ozsen, Optimal crawling strategies for web search engines, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511465]
|
CITED BY 6
|
|
|
|
|
|
|
|
|
|
|
Marc Spaniol , Dimitar Denev , Arturas Mazeika , Gerhard Weikum , Pierre Senellart, Data quality in web archiving, Proceedings of the 3rd workshop on Information credibility on the web, April 20-20, 2009, Madrid, Spain
|
|
|
|
|
|
Jiang-Ming Yang , Rui Cai , Chunsong Wang , Hua Huang , Lei Zhang , Wei-Ying Ma, Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|