ACM Home Page
Please provide us with feedback. Feedback
Managing duplicates in a web archive
Full text PdfPdf (267 KB)
Source Symposium on Applied Computing archive
Proceedings of the 2006 ACM symposium on Applied computing table of contents
Dijon, France
SESSION: Document engineering (DE) table of contents
Pages: 818 - 825  
Year of Publication: 2006
ISBN:1-59593-108-2
Authors
Daniel Gomes  Universidade de Lisboa, Lisboa, Portugal
André L. Santos  Universidade de Lisboa, Lisboa, Portugal
Mário J. Silva  Universidade de Lisboa, Lisboa, Portugal
Sponsor
SIGAPP: ACM Special Interest Group on Applied Computing
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 42,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1141277.1141465
What is a DOI?

ABSTRACT

Crawlers harvest the web by iteratively downloading documents referenced by URLs. It is frequent to find different URLs that refer to the same document, leading crawlers to download duplicates. Hence, web archives built through incremental crawls waste space storing these documents. In this paper, we study the existence of duplicates within a web archive and discuss strategies to eliminate them at storage level during the crawl. We present a storage system architecture that addresses the requirements of web archives and detail its implementation and evaluation. The system is now supporting an archive for the Portuguese web replacing previous NFS-based storage servers. Experimental results showed that the elimination of duplicates can improve storage throughput. The web storage system outperformed NFS based storage by 68% in read operations and by 50% in write operations.1


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
B. Berliner. CVS II: Parallelizing software development. In Proceedings of the USENIX Winter 1990 Technical Conference, pages 341--352, Berkeley, CA, 1990, USENIX Association.
 
2
 
3
S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. pages 398--409, 1995.
 
4
 
5
T. Burkard. Herodotus: A peer-to-peer web archival system, 2002.
 
6
M. Burner and B. Kahle. WWW Archive File Format Specification, September 1996.
 
7
B. Callaghan, B. Pawlowski, and P. Staubach. RFC 1813: NFS Version 3 Protocol Specification. Sun Microsystems, Inc., June 1995.
 
8
J. Campos. Versus: a web repository. Master thesis, 2003.
 
9
N. Cardoso, M. J. Silva, and M. Costa. The XLDB Group at CLEF'2004.
 
10
W. Cathro and T. Boston. Development of a digital services architecture at the national library of australia. EduCause, Australasia 2003, page 24, 2003.
11
 
12
J. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, S. Raghavan, and G. Wesley. Stanford webbase components and applications. Technical report, Stanford Database Group, July 2004.
 
13
B. F. Cooper, A. Crespo, and H. Garcia-Molina. The stanford archival repository project: Preserving our digital past. Technical report 2002-47, Department of Computer Science, Stanford University, October 2002.
14
 
15
L. Daigle, D. van Gulik, R. Iannella, and P. Faltstrom. Uniform Resource Names (URN) Namespace Definition Mechanisms, October 2002.
 
16
T. Denehy and W. Hsu. Duplicate management for reference data. Technical report RJ 10305, IBM Research, October 2003.
17
 
18
 
19
D. Gomes, A. L. Santos, and M. J. Silva. Webstore: A manager for incremental storage of contents. DI/FCUL TR 04--15, Department of Informatics, University of Lisbon, November 2004.
20
 
21
Y. Hafri and C. Djeraba. Dominos: A new web crawler's design. In 4th International Web Archiving Workshop (IWAW04), Bath, UK, September 2004.
 
22
J. Hakala. Collecting and preserving the web: Developing and testing the nedlib harvester. RLG Diginews, 5(2), April 2001.
 
23
 
24
R. J. Honicky and E. L. Miller. A fast algorithm for online placement and reorganization of replicated data. Nice, France, Apr. 2003.
25
 
26
C. Lampos, M. Eirinaki, D. Jevtuchova, and M. Vazirgiannis. Archiving the greek web. In 4th International Web Archiving Workshop (IWAW04), Bath, UK, September 2004.
 
27
J. MacDonald. Versioned file archiving, compression, and distribution.
 
28
J. Mogul. A trace-based analysis of duplicate suppression in HTTP. Technical Report 99/2, Compaq Computer Corporation Western Research Laboratory, November 1999.
 
29
30
 
31
K. Persson. Kulturarw description. http://www.kb.se/kw3/ENG/Description.htm, March 2005.
 
32
 
33
 
34
 
35
N. Shivakumar and H. García-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.
 
36
 
37
 
38
M. J. Silva. Searching and archiving the web with tumba! In CAPSI 2003 - 4a. Conferência da Associaĉão Portuguesa de Sistemas de Informaĉão, Porto, Portugal, November 2003.
39
 
40
L. L. You and C. Karamanolis. Evaluation of efficient archival storage techniques. In 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, April 2004.


Collaborative Colleagues:
Daniel Gomes: colleagues
André L. Santos: colleagues
Mário J. Silva: colleagues