ACM Home Page
Please provide us with feedback. Feedback
Architecture of the internet archive
Full text PdfPdf (329 KB)
Source ACM International Conference Proceeding Series archive
Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference table of contents
Haifa, Israel
SESSION: Storage table of contents
Article No. 11  
Year of Publication: 2009
ISBN:978-1-60558-623-6
Authors
Elliot Jaffe  The Hebrew University of Jerusalem, Jerusalem, Israel
Scott Kirkpatrick  The Hebrew University of Jerusalem, Jerusalem, Israel
Sponsors
: Melanox Technologies
: Hebrew University of Jerusalem
IBM : IBM
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 26,   Downloads (12 Months): 97,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1534530.1534545
What is a DOI?

ABSTRACT

The Internet Archive is a live production system supporting close to a petabyte of data and delivering an average of 2.3Gb/sec of data to Internet users. We describe the architecture of this system with an emphasis on its robustness and how it is managed by a very small team of systems personnel. Notably, the current system does not employ a cache. We analyze the reasons for this decision and show that an effective cache could not be built until now. However, new solid state disk technology may offer promising new cache implementations.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Apache Software Foundation. Hadoop Core, 2008. http://hadoop.apache.org/core/.
 
2
Bibliotheca Alexandria, 2009. http://www.bibalex.org.
 
3
D. Borthakur. The Hadoop Distributed File System: Architecture and Design. The Apache Software Foundation, 2007.
 
4
M. Burner and B. Khale. WWW Archive File Format Specification, 2002. http://web.archive.org/web/20021002080721/pages.alexa.com/company/arcformat.html.
5
 
6
Carnegie Mellon University Libraries. Frequently Asked Questions About the Million Book Project, 2008. http://www.library.cmu.edu/Libraries/MBP_FAQ.html.
7
 
8
 
9
B. D. Davison. A survey of proxy cache evaluation techniques. In WCW99: Proceedings of the Fourth International Web Caching Workshop, pages 67--77, 1999.
 
10
 
11
 
12
C. Gaspar. Deploying Nagios in a Large Enterprise Environment. In LISA. USENIX, 2007.
13
14
 
15
 
16
W. Hou and O. Okogbaa. Reliability and availability cost design tradeoffs for HA systems. Reliability and Maintainability Symposium, 2005. Proceedings. Annual, pages 433--438, 24--27, 2005.
 
17
T. Kelly. Priority depth (generalized stack distance) implementation in ANSI C, 2000. http://ai.eecs.umich.edu/œtpkelly/papers/.
 
18
T. Kelly and D. Reeves. Optimal web cache sizing: scalable methods for exact solutions. Computer Communications, 24(2):163--173, 2001.
 
19
 
20
 
21
R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78, 1970.
22
 
23
 
24
R. Prelinger. www.prelinger.com, 2008. http://www.panix.com/~footage/.
 
25
T. Schwarz, M. Baker, S. Bassi, B. Baumgart, W. Flagg, C. van Ingen, K. Joste, M. Manasse, and M. Shah. Disk failure investigations at the internet archive. In MSST2006: 23rd IEEE, 14th NASA Goddard Conference on Mass Storage Systems and Technologies, May 2006.
 
26
S. Technology. Seagate technology - cheetah hard drive family, 2009. http://www.seagate.com/www/en-us/products/servers/cheetah/.
 
27
Yahoo!, 2008. http://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html.

Collaborative Colleagues:
Elliot Jaffe: colleagues
Scott Kirkpatrick: colleagues