|
ABSTRACT
Our previous research has shown that the collective behavior of search engine caches (e.g., Google, Yahoo, Live Search) and web archives (e.g., Internet Archive) results in the uncoordinated but large-scale refreshing and migrating of web resources. Interacting with these caches and archives, which we call the Web Infrastructure (WI), allows entire websites to be reconstructed in an approach we call lazy preservation. Unfortunately, the WI only captures the client-side view of a web resource. While this may be useful for recovering much of the content of a website, it is not helpful for restoring the scripts, web server configuration, databases, and other server-side components responsible for the construction of the website's resources. This paper proposes a novel technique for storing and recovering the server-side components of a website from the WI. Using erasure codes to embed the server-side components as HTML comments throughout the website, we can effectively reconstruct all the server components of a website when only a portion of the client-side resources have been extracted from the WI. We present the results of a preliminary study that baselines the lazy preservation of ten EPrints repositories and then examines the preservation of an EPrints repository that uses the erasure code technique to store the server-side EPrints software throughout the website. We found nearly 100% of the EPrints components were recoverable from the WI just two weeks after the repository came online, and it remained recoverable four months after it was "lost".
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
A. Arvidson, K. Persson, and J. Mannerheim. The Kulturarw3 Project - The Royal Swedish Web Archiw3e - An example of "complete" collection of web pages. In Proceedings of the 66th IFLA Council and General Conference, Aug. 2000. http: //www.ifla.org/IV/ifla66/papers/154-157e.htm.
|
 |
3
|
|
| |
4
|
M. K. Bergman. The deep web: Surfacing hidden value. The Journal of Electronic Publishing, Aug. 2001. http://www.press.umich.edu/jep/07-01/bergman.html.
|
| |
5
|
A. Cantrell. Data backup no big deal to many, until... CNNMoney.com, 2006. http://money.cnn.com/2006/06/07/technology/data_loss/index.htm.
|
| |
6
|
Consultative Committee for Space Data Systems. Reference model for an open archival information system (OAIS). Technical Report CCSDS 650.0-B-1, 2002.
|
 |
7
|
|
| |
8
|
Eprints for digital repositories. http://www.eprints.org/.
|
 |
9
|
|
| |
10
|
E. D. Karnin, J. W. Greene, and M. E. Hellman. On secret sharing systems. IEEE Transactions on Information Theory, 29(1):35--41, 1983.
|
| |
11
|
|
| |
12
|
M. Koster. A standard for robot exclusion, June 1994. http://www.robotstxt.org/wc/norobots.html.
|
| |
13
|
C. Marshall, F. McCown, and M. L. Nelson. Evaluating personal archiving strategies for Internet-based information. In Proceedings of IS&T Archiving 2007, pages 151--156, May 2007. arXiv:0704.3647v1.
|
| |
14
|
J. Masanès. Web archiving methods and approaches: A comparative study. Library Trends, 54(1):72--90, 2005.
|
| |
15
|
F. McCown. Windows Live Search development forum: Image search with 'site:' operator, June 2007. http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1799762&SiteID=1.
|
 |
16
|
|
| |
17
|
|
| |
18
|
F. McCown, C. C. Marshall, and M. L. Nelson. Why websites are lost (and how they're sometimes found). Communications of the ACM, 2008. To appear.
|
 |
19
|
|
| |
20
|
F. McCown and M. L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, pages 48--52, May 2007. arXiv:cs/0703083v2.
|
 |
21
|
|
 |
22
|
|
| |
23
|
G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. An introduction to Heritrix, an archival quality web crawler. In Proceedings of IWAW '04, Sept. 2004.
|
| |
24
|
E. T. O'Neill, B. F. Lavoie, and R. Bennett. Trends in the evolution of the public web. D-Lib Magazine, 3(4), Apr. 2003.
|
| |
25
|
|
 |
26
|
|
| |
27
|
Registry of Open Access Repositories (ROAR). http://roar.eprints.org/.
|
| |
28
|
|
 |
29
|
Avishay Traeger , Nikolai Joukov , Josef Sipek , Erez Zadok, Using free web storage for data backup, Proceedings of the second ACM workshop on Storage security and survivability, October 30-30, 2006, Alexandria, Virginia, USA
[doi> 10.1145/1179559.1179574]
|
| |
30
|
|
| |
31
|
B. Wu and B. Davison. Cloaking and redirection: A preliminary study. In Proceedings of AIRWeb '05, May 2005.
|
|