ACM Home Page
Please provide us with feedback. Feedback
Factors affecting website reconstruction from the web infrastructure
Full text PdfPdf (324 KB)
Source
International Conference on Digital Libraries archive
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries table of contents
Vancouver, BC, Canada
SESSION: Digital curation and preservation table of contents
Pages: 39 - 48  
Year of Publication: 2007
ISBN:978-1-59593-644-8
Authors
Frank McCown  Old Dominion University, Norfolk, VA
Norou Diawara  Old Dominion University, Norfolk, VA
Michael L. Nelson  Old Dominion University, Norfolk, VA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 76,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1255175.1255182
What is a DOI?

ABSTRACT

When a website is suddenly lost without a backup, it maybe reconstituted by probing web archives and search engine caches for missing content. In this paper we describe an experiment where we crawled and reconstructed 300 randomly selected websites on a weekly basis for 14 weeks. The reconstructions were performed using our web-repository crawler named Warrick which recovers missing resources from the Web Infrastructure (WI), the collective preservation effort of web archives and search engine caches. We examine several characteristics of the websites over time including birth rate, decay and age of resources. We evaluate the reconstructions when compared to the crawled sites and develop a statistical model for predicting reconstruction success from the WI. On average, we were able to recover 61% of each website's resources. We found that Google's PageRank, number of hops and resource age were the three most significant factors in determining if a resource would be recovered from the WI.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
L. A. Adamic and B. A. Huberman. Zipf's law and the Internet. Glottometrics, 3:143--150, 2002.
 
2
Alexa toolbar. http://download.alexa.com/.
3
4
 
5
 
6
7
 
8
D. Clinton. Beyond the SOAP search API, Dec. 2006. http://google-code-updates.blogspot.com/2006/12/beyond-soap-search-api.html.
 
9
M. Cutts. GoogleGuy's posts, June 2005. http://www.webmasterworld.com/forum30/29720.htm.
10
 
11
12
 
13
J. Galt. Google says: Toolbar PageRank is for entertainment purposes only, 2004. http://forums.searchenginewatch.com/showthread.php?t=3054.
 
14
Google Sitemap Protocol. https://www.google.com/webmasters/tools/docs/en/protocol.html.
 
15
Google webmaster help center: Webmaster guidelines, 2007. http://www.google.com/support/webmasters/bin/answer.py?answer=35769.
16
17
 
18
Internet Archive FAQ: How can I get my site included in the Archive?http://www.archive.org/about/faqs.php.
 
19
Jon. How the Google cache can save your a$$, Dec. 2005. http://www.smartmoneydaily.com/Business/How-the-Google-Cache-can-Save-You.aspx.
 
20
 
21
 
22
C. Marhsall, F. McCown, and M. L. Nelson. Evaluating personal archiving strategies for Internet-based information. In Proceedings of IS&T Archiving 2007, May 2007.
 
23
F. McCown. Mark Foley websites - reconstructed, 2006. http://www.cs.odu.edu/~fmccown/foley/.
 
24
25
26
 
27
F. McCown and M. L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, 2007.
28
 
29
G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. An introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept. 2004.
 
30
M. L. Nelson and B. D. Allen. Object persistence and availability in digital libraries. D-Lib Magazine, 8(1),2002.
31
32
 
33
S. Olsen. Court backs thumbnail image linking. CNET News.com, July 2003. http://news.com.com/2100-1025_3-1023629.html.
 
34
S. Olsen. Google cache raises copyright concerns. CNET News.com, July 2003. http://news.com.com/2100-1038_3-1024234.html.
 
35
M. Thelwall. Methodologies for crawler based web surveys. Internet Research, 12(2):124--138, 2002.
 
36
 
37
M. Thelwall and L. Vaughan. A fair history of the Web? Examining country balance in the Internet Archive. Library & Information Science Research, 26(2):162--176, 2004.
 
38
 
39
Yahoo Site Explorer. http://siteexplorer.search.yahoo.com/.


Collaborative Colleagues:
Frank McCown: colleagues
Norou Diawara: colleagues
Michael L. Nelson: colleagues