|
ABSTRACT
The World Wide Web has become a key source of knowledge pertaining to almost every walk of life. Unfortunately, much of data on the Web is highly ephemeral in nature, with more than 50-80% of content estimated to be changing within a short time. Continuing the pioneering efforts of many national (digital) libraries, organizations such as the International Internet Preservation Consortium (IIPC), the Internet Archive (IA) and the European Archive (EA) have been tirelessly working towards preserving the ever changing Web. However, while these web archiving efforts have paid significant attention towards long term preservation of Web data, they have paid little attention to developing an global-scale infrastructure for collecting, archiving, and performing historical analyzes on the collected data. Based on insights from our recent work on building text analytics for Web Archives, we propose EverLast, a scalable distributed framework for next generation Web archival and temporal text analytics over the archive. Our system is built on a loosely-coupled distributed architecture that can be deployed over large-scale peer-to-peer networks. In this way, we allow the integration of many archival efforts taken mainly at a national level by national digital libraries. Key features of EverLast include support of time-based text search & analysis and the use of human-assisted archive gathering. In this paper, we outline the overall architecture of EverLast, and present some promising preliminary results.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Internet archive. http://archive.org.
|
| |
2
|
Swedish royal library: Kulturarw3 - long-term preservation of electronic documents. http://www.kb.se/kw3/ENG/.
|
 |
3
|
Eytan Adar , Mira Dontcheva , James Fogarty , Daniel S. Weld, Zoetrope: interacting with the ephemeral web, Proceedings of the 21st annual ACM symposium on User interface software and technology, October 19-22, 2008, Monterey, CA, USA
[doi> 10.1145/1449715.1449756]
|
| |
4
|
Avishek Anand. Indexing partitioning techniques for peer-to-peer web archival. Master's thesis, Universitaet des Saarlandes, FR Informatik, 2009.
|
| |
5
|
M. Arlitt and T. Jin. 1998 World Cup Site Access Logs. http://www.acm.org/sigcomm/ITA/, 1998.
|
 |
6
|
William Y. Arms , Selcuk Aya , Pavel Dmitriev , Blazej J. Kot , Ruth Mitchell , Lucia Walle, Building a research library for the history of the web, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
[doi> 10.1145/1141753.1141771]
|
| |
7
|
R. A. Baeza-Yates, C. Castillo, F. Junqueira, V. Plachouras, and F. Silvestri. Challenges on distributed web retrieval. In Proc. of ICDE, 2007.
|
| |
8
|
|
| |
9
|
M. Bender, S. Michel, J. X. Parreira, and T. Crecelius. P2p web search: Make it light, make it y (demo). In Proc. of CIDR, 2007.
|
 |
10
|
|
| |
11
|
|
| |
12
|
K. Berberich, S. Bedathur, and G. Weikum. Efficient Time-travel on Versioned Text Collections. In Proc. of GI-Fachtagung fur Datenbanksysteme in Business, Technologie und Web (BTW), 2007.
|
| |
13
|
K. Berberich, S. Bedathur, and G. Weikum. Tunable Word--Level Index Compression for Versioned Corpora. In Proc. of Workshop EIIR, 2008.
|
| |
14
|
Ranjita Bhagwan , Kiran Tati , Yu-Chung Cheng , Stefan Savage , Geoffrey M. Voelker, Total recall: system support for automated availability management, Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation, p.25-25, March 29-31, 2004, San Francisco, California
|
| |
15
|
Byung-Gon Chun , Frank Dabek , Andreas Haeberlen , Emil Sit , Hakim Weatherspoon , M. Frans Kaashoek , John Kubiatowicz , Robert Morris, Efficient replica maintenance for distributed storage systems, Proceedings of the 3rd conference on Networked Systems Design & Implementation, p.4-4, May 08-10, 2006, San Jose, CA
|
 |
16
|
|
| |
17
|
|
| |
18
|
Heritrix Archival Crawler. http://crawler.archive.org/.
|
| |
19
|
E. Herder. Characterizations of User Web Revisit Behavior. In Proc. of Workshop on Adaptivity and User Modeling in Interactive Systems, 2005.
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
 |
23
|
|
 |
24
|
|
| |
25
|
|
| |
26
|
I. Podnar, M. Rajman, T. Luu, F. Klemm, and K. Aberer. Scalable peer-to-peer web retrieval with highly discriminative keys. In Proc. of ICDE, 2007.
|
| |
27
|
|
 |
28
|
Antony Rowstron , Peter Druschel, Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility, Proceedings of the eighteenth ACM symposium on Operating systems principles, October 21-24, 2001, Banff, Alberta, Canada
|
 |
29
|
|
| |
30
|
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. of ICDE, 2001.
|
| |
31
|
A. Singh, M. Srivatsa, L. Liu, and T. Miller. Apoidea: A decentralized peer-to-peer architecture for crawling the world wide web. In Proc. of ACM SIGIR, 2003.
|
 |
32
|
Ion Stoica , Robert Morris , David Karger , M. Frans Kaashoek , Hari Balakrishnan, Chord: A scalable peer-to-peer lookup service for internet applications, Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, p.149-160, August 2001, San Diego, California, United States
|
 |
33
|
Stephan Strodl , Florian Motlik , Kevin Stadler , Andreas Rauber, Personal & soho archiving, Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, June 16-20, 2008, Pittsburgh PA, PA, USA
[doi> 10.1145/1378889.1378910]
|
| |
34
|
|
| |
35
|
|
| |
36
|
The Size of the World Wide Web. http://www.worldwidewebsize.com/, March 2008.
|
| |
37
|
C. Zimmer, S. Bedathur, and G. Weikum. Flood Little, Cache More: Effective Result-reuse in P2P IR Systems. In Proc. of DASFAA, 2008.
|
|