ACM Home Page
Please provide us with feedback. Feedback
Why web sites are lost (and how they're sometimes found)
Full text HtmlHtml (33 KB),  PdfPdf (734 KB)
Source
Communications of the ACM archive
Volume 52 ,  Issue 11  (November 2009) table of contents
Scratch Programming for All
SECTION: Virtual extension table of contents
Pages 141-145  
Year of Publication: 2009
ISSN:0001-0782
Authors
Frank McCown  Harding University, Searcy, AR
Catherine C. Marshall  Microsoft Research, Silicon Valley
Michael L. Nelson  Old Dominion University
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 204,   Downloads (12 Months): 204,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1592761.1592794
What is a DOI?

ABSTRACT

Introduction

The web is in constant flux---new pages and Web sites appear daily, and old pages and sites disappear almost as quickly. One study estimates that about two percent of the Web disappears from its current location every week.2 Although Web users have become accustomed to seeing the infamous "404 Not Found" page, they are more taken aback when they own, are responsible for, or have come to rely on the missing material.

Web archivists like those at the Internet Archive have responded to the Web's transience by archiving as much of it as possible, hoping to preserve snapshots of the Web for future generations.3 Search engines have also responded by offering pages that have been cached as a result of the indexing process. These straightforward archiving and caching efforts have been used by the public in unintended ways: individuals and organizations have used them to restore their own lost Web sites.5

To automate recovering lost Web sites, we created a Web-repository crawler named Warrick that restores lost resources from the holdings of four Web repositories: Internet Archive, Google, Live Search (now Bing), and Yahoo;6 we refer to these Web repositories collectively as the Web Infrastructure (WI). We call this after-loss recovery Lazy Preservation (see the sidebar for more information). Warrick can only recover what is accessible to the WI, namely the crawlable Web. There are numerous resources that cannot be found in the WI: password protected content, pages without incoming links or protected by the robots exclusion protocol, and content hidden behind Flash or JavaScript interfaces. Most importantly, WI crawlers do not have access to the server-side components (for example, scripts, configuration files, databases, among others) of a Web site.

Nevertheless, upon Warrick's public release in 2005, we received many inquiries about its usage and collected a handful of anecdotes about the Web sites individuals and organizations had lost and wanted to recover. Were these Web sites representative? What types of Web resources were people losing? Given the inherent limitations of the WI, were Warrick users recovering enough material to reconstruct the site? Were these losses changing their behavior, or was the availability of cached material reinforcing a "lazy" approach to preservation?

We constructed an online survey to explore these questions and conducted a set of in-depth interviews with survey respondents to clarify the results. Potential participants were solicited by us or the Internet Archive, or they found a link to the survey from the Warrick Web site. A total of 52 participants completed the survey regarding 55 lost Web sites, and seven of the participants allowed us to follow-up with telephone or instant messaging interviews. Participants were divided into two groups:

1. Personal loss: Those who had lost (and tried to recover) a Web site that they had personally created, maintained or owned (34 participants who lost 37 Web sites).

2. Third party: Those who had recovered someone else's lost Web site (18 participants who recovered 18 Web sites).


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Cox, L. P., Murray, C. D., and Noble, B. D. Pastiche: Making backup cheap and easy. SIGOPS Operating Systems Review 36, SI, (2002), 285--298.
 
2
Fetterly, D., Manasse, M., Najork, M., and Wiener, J. A large-scale study of the evolution of Web pages. In Proceedings of WWW '03, (2003), 669--678.
 
3
Kahle, B. Preserving the Internet. Scientific American, (Mar. 1997), 82--83.
 
4
Marshall, C., Bly, S., and Brun-Cottan, F. The long term fate of our personal digital belongings: Toward a service model for personal archives. In Proceedings of IS&T Archiving 2006, (2006), 25--30.
 
5
Marshall, C., McCown, F., and Nelson, M. L. Evaluating personal archiving strategies for Internet-based information. In Proceedings of IS&T Archiving 2007, (2007), 151--156.
 
6
McCown, F., Smith, J. A., Nelson, M. L., and Bollen, J. Lazy preservation: Reconstructing Websites by crawling the crawlers. In Proceedings of ACM WIDM '06, (2006), 67--74.
 
7
F. McCown, A. Benjelloun, and M. L. Nelson. Brass: A queueing manager for Warrick. In IWAW '07: Proceedings of the 7th International Web Archiving Workshop, June 2007.
 
8
F. McCown, N. Diawara, and M. L. Nelson. Factors affecting website reconstruction from the web infrastructure. In JCDL '07: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, June 2007, 39--48.
 
9
F. McCown, J. A. Smith, M. L. Nelson, and J. Bollen. Lazy preservation: Reconstructing websites by crawling the crawlers. In WIDM '06: Proceedings from the 8th ACM International Workshop on Web Information and Data Management, 2006, 67--74.
 
10
M. L. Nelson, F. McCown, J. A. Smith, and M. Klein. Using the web infrastructure to preserve web pages. International Journal on Digital Libraries, 6(4), 2007, 327--349.
 
11
M. O. Rabin. Efficient dispersal of information for security, load balancing, and fault tolerance. Journal of the ACM, 36(2), 1989, 335--348.