ACM Home Page
Please provide us with feedback. Feedback
Lazy preservation: reconstructing websites by crawling the crawlers
Full text PdfPdf (721 KB)
Source Workshop On Web Information And Data Management archive
Proceedings of the 8th annual ACM international workshop on Web information and data management table of contents
Arlington, Virginia, USA
SESSION: Web resource crawling and searching table of contents
Pages: 67 - 74  
Year of Publication: 2006
ISBN:1-59593-525-8
Authors
Frank McCown  Old Dominion University, Norfolk, Virginia
Joan A. Smith  Old Dominion University, Norfolk, Virginia
Michael L. Nelson  Old Dominion University, Norfolk, Virginia
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 55,   Citation Count: 7
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183550.1183564
What is a DOI?

ABSTRACT

Backup of websites is often not considered until after a catastrophic event has occurred to either the website or its webmaster. We introduce "lazy preservation" -- digital preservation performed as a result of the normal operation of web crawlers and caches. Lazy preservation is especially suitable for third parties; for example, a teacher reconstructing a missing website used in previous classes. We evaluate the effectiveness of lazy preservation by reconstructing 24 websites of varying sizes and composition using Warrick, a web-repository crawler. Because of varying levels of completeness in any one repository, our reconstructions sampled from four different web repositories: Google (44%), MSN (30%), Internet Archive (19%) and Yahoo (7%). We also measured the time required for web resources to be discovered and cached (10-103 days) as well as how long they remained in cache after deletion (7-61 days).


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
 
4
M. Burner. Crawling towards eternity: Building an archive of the world wide web. Web Techniques Magazine, 2(5), 1997.
 
5
6
 
7
B. F. Cooper and H. Garcia-Molina. Infomonitor: Unobtrusively archiving a World Wide Web server. International Journal on Digital Libraries, 5(2):106--119, April 2005.
 
8
M. Day. Collecting and preserving the World Wide Web. 2003. http://library.wellcome.ac.uk/assets/WTL039229.pdf.
9
10
11
 
12
Google Sitemap Protocol, 2005. http://www.google.com/webmasters/sitemaps/docs/en/protocol.html.
 
13
Google webmaster help center: Webmaster guidelines, 2006. http://www.google.com/support/webmasters/bin/answer.py?answer=35769.
 
14
15
 
16
Internet Archive FAQ: How can I get my site included in the Archive?, 2006. http://www.archive.org/about/faqs.php.
 
17
 
18
19
 
20
F. McCown, J. A. Smith, M. L. Nelson, and J. Bollen. Reconstructing websites for the lazy webmaster. Technical report, Old Dominion University, 2005. http://arxiv.org/abs/cs.IR/0512069.
21
 
22
23
 
24
V. Reich and D. S. Rosenthal. LOCKSS: A permanent web publishing and access system. D-Lib Magazine, 7(6), 2001.
 
25
A. Ross. Internet Archive forums: Web forum posting. Oct 2004. http://www.archive.org/iathreads/post-view.php?id=23121.
 
26
J. A. Smith, F. McCown, and M. L. Nelson. Observed web robot behavior on decaying web subsites. D-Lib Magazine, 12(2), Feb 2006.
 
27
 
28


Collaborative Colleagues:
Frank McCown: colleagues
Joan A. Smith: colleagues
Michael L. Nelson: colleagues