|
ABSTRACT
Crawlers harvest the web by iteratively downloading documents referenced by URLs. It is frequent to find different URLs that refer to the same document, leading crawlers to download duplicates. Hence, web archives built through incremental crawls waste space storing these documents. In this paper, we study the existence of duplicates within a web archive and discuss strategies to eliminate them at storage level during the crawl. We present a storage system architecture that addresses the requirements of web archives and detail its implementation and evaluation. The system is now supporting an archive for the Portuguese web replacing previous NFS-based storage servers. Experimental results showed that the elimination of duplicates can improve storage throughput. The web storage system outperformed NFS based storage by 68% in read operations and by 50% in write operations.1
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
B. Berliner. CVS II: Parallelizing software development. In Proceedings of the USENIX Winter 1990 Technical Conference, pages 341--352, Berkeley, CA, 1990, USENIX Association.
|
| |
2
|
|
| |
3
|
S. Brin, J. Davis, and H. García-Molina. Copy detection mechanisms for digital documents. pages 398--409, 1995.
|
| |
4
|
|
| |
5
|
T. Burkard. Herodotus: A peer-to-peer web archival system, 2002.
|
| |
6
|
M. Burner and B. Kahle. WWW Archive File Format Specification, September 1996.
|
| |
7
|
B. Callaghan, B. Pawlowski, and P. Staubach. RFC 1813: NFS Version 3 Protocol Specification. Sun Microsystems, Inc., June 1995.
|
| |
8
|
J. Campos. Versus: a web repository. Master thesis, 2003.
|
| |
9
|
N. Cardoso, M. J. Silva, and M. Costa. The XLDB Group at CLEF'2004.
|
| |
10
|
W. Cathro and T. Boston. Development of a digital services architecture at the national library of australia. EduCause, Australasia 2003, page 24, 2003.
|
 |
11
|
|
| |
12
|
J. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, S. Raghavan, and G. Wesley. Stanford webbase components and applications. Technical report, Stanford Database Group, July 2004.
|
| |
13
|
B. F. Cooper, A. Crespo, and H. Garcia-Molina. The stanford archival repository project: Preserving our digital past. Technical report 2002-47, Department of Computer Science, Stanford University, October 2002.
|
 |
14
|
|
| |
15
|
L. Daigle, D. van Gulik, R. Iannella, and P. Faltstrom. Uniform Resource Names (URN) Namespace Definition Mechanisms, October 2002.
|
| |
16
|
T. Denehy and W. Hsu. Duplicate management for reference data. Technical report RJ 10305, IBM Research, October 2003.
|
 |
17
|
|
| |
18
|
Raphael A. Finkel , Arkady Zaslavsky , Krisztián Monostori , Heinz Schmidt, Signature extraction for overlap detection in documents, Proceedings of the twenty-fifth Australasian conference on Computer science, p.59-64, January 01, 2002, Melbourne, Victoria, Australia
|
| |
19
|
D. Gomes, A. L. Santos, and M. J. Silva. Webstore: A manager for incremental storage of contents. DI/FCUL TR 04--15, Department of Informatics, University of Lisbon, November 2004.
|
 |
20
|
|
| |
21
|
Y. Hafri and C. Djeraba. Dominos: A new web crawler's design. In 4th International Web Archiving Workshop (IWAW04), Bath, UK, September 2004.
|
| |
22
|
J. Hakala. Collecting and preserving the web: Developing and testing the nedlib harvester. RLG Diginews, 5(2), April 2001.
|
| |
23
|
|
| |
24
|
R. J. Honicky and E. L. Miller. A fast algorithm for online placement and reorganization of replicated data. Nice, France, Apr. 2003.
|
 |
25
|
|
| |
26
|
C. Lampos, M. Eirinaki, D. Jevtuchova, and M. Vazirgiannis. Archiving the greek web. In 4th International Web Archiving Workshop (IWAW04), Bath, UK, September 2004.
|
| |
27
|
J. MacDonald. Versioned file archiving, compression, and distribution.
|
| |
28
|
J. Mogul. A trace-based analysis of duplicate suppression in HTTP. Technical Report 99/2, Compaq Computer Corporation Western Research Laboratory, November 1999.
|
| |
29
|
|
 |
30
|
|
| |
31
|
K. Persson. Kulturarw description. http://www.kb.se/kw3/ENG/Description.htm, March 2005.
|
| |
32
|
|
| |
33
|
Sean Rhea , Patrick Eaton , Dennis Geels , Hakim Weatherspoon , Ben Zhao , John Kubiatowicz, Awarded Best Student Paper! - Pond: The OceanStore Prototype, Proceedings of the 2nd USENIX Conference on File and Storage Technologies, March 31-31, 2003, San Francisco, CA
|
| |
34
|
|
| |
35
|
N. Shivakumar and H. García-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.
|
| |
36
|
|
| |
37
|
|
| |
38
|
M. J. Silva. Searching and archiving the web with tumba! In CAPSI 2003 - 4a. Conferência da Associaĉão Portuguesa de Sistemas de Informaĉão, Porto, Portugal, November 2003.
|
 |
39
|
|
| |
40
|
L. L. You and C. Karamanolis. Evaluation of efficient archival storage techniques. In 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, April 2004.
|
|