ACM Home Page
Please provide us with feedback. Feedback
EverLast: a distributed architecture for preserving the web
Full text PdfPdf (600 KB)
Source
International Conference on Digital Libraries archive
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries table of contents
Austin, TX, USA
SESSION: 12 table of contents
Pages 331-340  
Year of Publication: 2009
ISBN:978-1-60558-322-8
Authors
Avishek Anand  Max-Planck Institute for Informatics, Saarbrücken, Germany
Srikanta Bedathur  Max-Planck Institute for Informatics, Saarbrücken, Germany
Klaus Berberich  Max-Planck Institute for Informatics, Saarbrücken, Germany
Ralf Schenkel  Saarland University, Saarbrücken, Germany
Christos Tryfonopoulos  Max-Planck Institute for Informatics, Saarbrücken, Germany
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 32,   Downloads (12 Months): 95,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1555400.1555455
What is a DOI?

ABSTRACT

The World Wide Web has become a key source of knowledge pertaining to almost every walk of life. Unfortunately, much of data on the Web is highly ephemeral in nature, with more than 50-80% of content estimated to be changing within a short time. Continuing the pioneering efforts of many national (digital) libraries, organizations such as the International Internet Preservation Consortium (IIPC), the Internet Archive (IA) and the European Archive (EA) have been tirelessly working towards preserving the ever changing Web.

However, while these web archiving efforts have paid significant attention towards long term preservation of Web data, they have paid little attention to developing an global-scale infrastructure for collecting, archiving, and performing historical analyzes on the collected data. Based on insights from our recent work on building text analytics for Web Archives, we propose EverLast, a scalable distributed framework for next generation Web archival and temporal text analytics over the archive. Our system is built on a loosely-coupled distributed architecture that can be deployed over large-scale peer-to-peer networks. In this way, we allow the integration of many archival efforts taken mainly at a national level by national digital libraries. Key features of EverLast include support of time-based text search & analysis and the use of human-assisted archive gathering. In this paper, we outline the overall architecture of EverLast, and present some promising preliminary results.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Internet archive. http://archive.org.
 
2
Swedish royal library: Kulturarw3 - long-term preservation of electronic documents. http://www.kb.se/kw3/ENG/.
3
 
4
Avishek Anand. Indexing partitioning techniques for peer-to-peer web archival. Master's thesis, Universitaet des Saarlandes, FR Informatik, 2009.
 
5
M. Arlitt and T. Jin. 1998 World Cup Site Access Logs. http://www.acm.org/sigcomm/ITA/, 1998.
6
 
7
R. A. Baeza-Yates, C. Castillo, F. Junqueira, V. Plachouras, and F. Silvestri. Challenges on distributed web retrieval. In Proc. of ICDE, 2007.
 
8
 
9
M. Bender, S. Michel, J. X. Parreira, and T. Crecelius. P2p web search: Make it light, make it y (demo). In Proc. of CIDR, 2007.
10
 
11
 
12
K. Berberich, S. Bedathur, and G. Weikum. Efficient Time-travel on Versioned Text Collections. In Proc. of GI-Fachtagung fur Datenbanksysteme in Business, Technologie und Web (BTW), 2007.
 
13
K. Berberich, S. Bedathur, and G. Weikum. Tunable Word--Level Index Compression for Versioned Corpora. In Proc. of Workshop EIIR, 2008.
 
14
 
15
16
 
17
 
18
Heritrix Archival Crawler. http://crawler.archive.org/.
 
19
E. Herder. Characterizations of User Web Revisit Behavior. In Proc. of Workshop on Adaptivity and User Modeling in Interactive Systems, 2005.
 
20
 
21
 
22
23
24
 
25
 
26
I. Podnar, M. Rajman, T. Luu, F. Klemm, and K. Aberer. Scalable peer-to-peer web retrieval with highly discriminative keys. In Proc. of ICDE, 2007.
 
27
28
29
 
30
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. of ICDE, 2001.
 
31
A. Singh, M. Srivatsa, L. Liu, and T. Miller. Apoidea: A decentralized peer-to-peer architecture for crawling the world wide web. In Proc. of ACM SIGIR, 2003.
32
33
 
34
 
35
 
36
The Size of the World Wide Web. http://www.worldwidewebsize.com/, March 2008.
 
37
C. Zimmer, S. Bedathur, and G. Weikum. Flood Little, Cache More: Effective Result-reuse in P2P IR Systems. In Proc. of DASFAA, 2008.

Collaborative Colleagues:
Avishek Anand: colleagues
Srikanta Bedathur: colleagues
Klaus Berberich: colleagues
Ralf Schenkel: colleagues
Christos Tryfonopoulos: colleagues