|
ABSTRACT
When a website is suddenly lost without a backup, it maybe reconstituted by probing web archives and search engine caches for missing content. In this paper we describe an experiment where we crawled and reconstructed 300 randomly selected websites on a weekly basis for 14 weeks. The reconstructions were performed using our web-repository crawler named Warrick which recovers missing resources from the Web Infrastructure (WI), the collective preservation effort of web archives and search engine caches. We examine several characteristics of the websites over time including birth rate, decay and age of resources. We evaluate the reconstructions when compared to the crawled sites and develop a statistical model for predicting reconstruction success from the WI. On average, we were able to recover 61% of each website's resources. We found that Google's PageRank, number of hops and resource age were the three most significant factors in determining if a resource would be recovered from the WI.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
L. A. Adamic and B. A. Huberman. Zipf's law and the Internet. Glottometrics, 3:143--150, 2002.
|
| |
2
|
Alexa toolbar. http://download.alexa.com/.
|
 |
3
|
Ziv Bar-Yossef , Andrei Z. Broder , Ravi Kumar , Andrew Tomkins, Sic transit gloria telae: towards an understanding of the web's decay, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988716]
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
D. Clinton. Beyond the SOAP search API, Dec. 2006. http://google-code-updates.blogspot.com/2006/12/beyond-soap-search-api.html.
|
| |
9
|
M. Cutts. GoogleGuy's posts, June 2005. http://www.webmasterworld.com/forum30/29720.htm.
|
 |
10
|
Zubin Dalal , Suvendu Dash , Pratik Dave , Luis Francisco-Revilla , Richard Furuta , Unmil Karadkar , Frank Shipman, Managing distributed collections: evaluating web page changes, movement, and replacement, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
[doi> 10.1145/996350.996387]
|
| |
11
|
Fred Douglis , Anja Feldmann , Balachander Krishnamurthy , Jeffrey Mogul, Rate of change and other metrics: a live study of the world wide web, Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems, p.14-14, December 08-11, 1997, Monterey, California
|
 |
12
|
|
| |
13
|
J. Galt. Google says: Toolbar PageRank is for entertainment purposes only, 2004. http://forums.searchenginewatch.com/showthread.php?t=3054.
|
| |
14
|
Google Sitemap Protocol. https://www.google.com/webmasters/tools/docs/en/protocol.html.
|
| |
15
|
Google webmaster help center: Webmaster guidelines, 2007. http://www.google.com/support/webmasters/bin/answer.py?answer=35769.
|
 |
16
|
|
 |
17
|
|
| |
18
|
Internet Archive FAQ: How can I get my site included in the Archive?http://www.archive.org/about/faqs.php.
|
| |
19
|
Jon. How the Google cache can save your a$$, Dec. 2005. http://www.smartmoneydaily.com/Business/How-the-Google-Cache-can-Save-You.aspx.
|
| |
20
|
|
| |
21
|
Steve Lawrence , David M. Pennock , Gary William Flake , Robert Krovetz , Frans M. Coetzee , Eric Glover , Finn Årup Nielsen , Andries Kruger , C. Lee Giles, Persistence of Web References in Scientific Research, Computer, v.34 n.2, p.26-31, February 2001
|
| |
22
|
C. Marhsall, F. McCown, and M. L. Nelson. Evaluating personal archiving strategies for Internet-based information. In Proceedings of IS&T Archiving 2007, May 2007.
|
| |
23
|
F. McCown. Mark Foley websites - reconstructed, 2006. http://www.cs.odu.edu/~fmccown/foley/.
|
| |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
F. McCown and M. L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, 2007.
|
 |
28
|
|
| |
29
|
G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. An introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept. 2004.
|
| |
30
|
M. L. Nelson and B. D. Allen. Object persistence and availability in digital libraries. D-Lib Magazine, 8(1),2002.
|
 |
31
|
Michael L. Nelson , Joan A. Smith , Ignacio Garcia del Campo, Efficient, automatic web resource harvesting, Proceedings of the 8th annual ACM international workshop on Web information and data management, November 10-10, 2006, Arlington, Virginia, USA
[doi> 10.1145/1183550.1183560]
|
 |
32
|
|
| |
33
|
S. Olsen. Court backs thumbnail image linking. CNET News.com, July 2003. http://news.com.com/2100-1025_3-1023629.html.
|
| |
34
|
S. Olsen. Google cache raises copyright concerns. CNET News.com, July 2003. http://news.com.com/2100-1038_3-1024234.html.
|
| |
35
|
M. Thelwall. Methodologies for crawler based web surveys. Internet Research, 12(2):124--138, 2002.
|
| |
36
|
|
| |
37
|
M. Thelwall and L. Vaughan. A fair history of the Web? Examining country balance in the Internet Archive. Library & Information Science Research, 26(2):162--176, 2004.
|
| |
38
|
|
| |
39
|
Yahoo Site Explorer. http://siteexplorer.search.yahoo.com/.
|
CITED BY 5
|
|
|
|
|
|
|
|
|
|
|
Adam Jatowt , Yukiko Kawai , Hiroaki Ohshima , Katsumi Tanaka, What can history tell us?: towards different models of interaction with document histories, Proceedings of the nineteenth ACM conference on Hypertext and hypermedia, June 19-21, 2008, Pittsburgh, PA, USA
|
|
|
|
|