| Lazy preservation: reconstructing websites by crawling the crawlers |
| Full text |
Pdf
(721 KB)
|
| Source
|
Workshop On Web Information And Data Management
archive
Proceedings of the 8th annual ACM international workshop on Web information and data management
table of contents
Arlington, Virginia, USA
SESSION: Web resource crawling and searching
table of contents
Pages: 67 - 74
Year of Publication: 2006
ISBN:1-59593-525-8
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 6, Downloads (12 Months): 55, Citation Count: 7
|
|
|
ABSTRACT
Backup of websites is often not considered until after a catastrophic event has occurred to either the website or its webmaster. We introduce "lazy preservation" -- digital preservation performed as a result of the normal operation of web crawlers and caches. Lazy preservation is especially suitable for third parties; for example, a teacher reconstructing a missing website used in previous classes. We evaluate the effectiveness of lazy preservation by reconstructing 24 websites of varying sizes and composition using Warrick, a web-repository crawler. Because of varying levels of completeness in any one repository, our reconstructions sampled from four different web repositories: Google (44%), MSN (30%), Internet Archive (19%) and Yahoo (7%). We also measured the time required for web resources to be discovered and cached (10-103 days) as well as how long they remained in cache after deletion (7-61 days).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
M. Burner. Crawling towards eternity: Building an archive of the world wide web. Web Techniques Magazine, 2(5), 1997.
|
| |
5
|
|
 |
6
|
Junghoo Cho , Narayanan Shivakumar , Hector Garcia-Molina, Finding replicated Web collections, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.355-366, May 15-18, 2000, Dallas, Texas, United States
|
| |
7
|
B. F. Cooper and H. Garcia-Molina. Infomonitor: Unobtrusively archiving a World Wide Web server. International Journal on Digital Libraries, 5(2):106--119, April 2005.
|
| |
8
|
M. Day. Collecting and preserving the World Wide Web. 2003. http://library.wellcome.ac.uk/assets/WTL039229.pdf.
|
 |
9
|
|
 |
10
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
 |
11
|
|
| |
12
|
Google Sitemap Protocol, 2005. http://www.google.com/webmasters/sitemaps/docs/en/protocol.html.
|
| |
13
|
Google webmaster help center: Webmaster guidelines, 2006. http://www.google.com/support/webmasters/bin/answer.py?answer=35769.
|
| |
14
|
|
 |
15
|
|
| |
16
|
Internet Archive FAQ: How can I get my site included in the Archive?, 2006. http://www.archive.org/about/faqs.php.
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
| |
20
|
F. McCown, J. A. Smith, M. L. Nelson, and J. Bollen. Reconstructing websites for the lazy webmaster. Technical report, Old Dominion University, 2005. http://arxiv.org/abs/cs.IR/0512069.
|
 |
21
|
|
| |
22
|
|
 |
23
|
|
| |
24
|
V. Reich and D. S. Rosenthal. LOCKSS: A permanent web publishing and access system. D-Lib Magazine, 7(6), 2001.
|
| |
25
|
A. Ross. Internet Archive forums: Web forum posting. Oct 2004. http://www.archive.org/iathreads/post-view.php?id=23121.
|
| |
26
|
J. A. Smith, F. McCown, and M. L. Nelson. Observed web robot behavior on decaying web subsites. D-Lib Magazine, 12(2), Feb 2006.
|
| |
27
|
|
| |
28
|
|
|