|
ABSTRACT
The Internet Archive is a live production system supporting close to a petabyte of data and delivering an average of 2.3Gb/sec of data to Internet users. We describe the architecture of this system with an emphasis on its robustness and how it is managed by a very small team of systems personnel. Notably, the current system does not employ a cache. We analyze the reasons for this decision and show that an effective cache could not be built until now. However, new solid state disk technology may offer promising new cache implementations.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Apache Software Foundation. Hadoop Core, 2008. http://hadoop.apache.org/core/.
|
| |
2
|
Bibliotheca Alexandria, 2009. http://www.bibalex.org.
|
| |
3
|
D. Borthakur. The Hadoop Distributed File System: Architecture and Design. The Apache Software Foundation, 2007.
|
| |
4
|
M. Burner and B. Khale. WWW Archive File Format Specification, 2002. http://web.archive.org/web/20021002080721/pages.alexa.com/company/arcformat.html.
|
 |
5
|
|
| |
6
|
Carnegie Mellon University Libraries. Frequently Asked Questions About the Million Book Project, 2008. http://www.library.cmu.edu/Libraries/MBP_FAQ.html.
|
 |
7
|
|
| |
8
|
|
| |
9
|
B. D. Davison. A survey of proxy cache evaluation techniques. In WCW99: Proceedings of the Fourth International Web Caching Workshop, pages 67--77, 1999.
|
| |
10
|
Denice Deatrich , Simon Liu , Chris Payne , Réda Tafirout , Rodney Walker , Andrew Wong , Michel Vetterli, Managing Petabyte-Scale Storage for the ATLAS Tier-1 Centre at TRIUMF, Proceedings of the 2008 22nd International Symposium on High Performance Computing Systems and Applications, p.167-171, June 09-11, 2008
[doi> 10.1109/HPCS.2008.27]
|
| |
11
|
|
| |
12
|
C. Gaspar. Deploying Nagios in a Large Enterprise Environment. In LISA. USENIX, 2007.
|
 |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
W. Hou and O. Okogbaa. Reliability and availability cost design tradeoffs for HA systems. Reliability and Maintainability Symposium, 2005. Proceedings. Annual, pages 433--438, 24--27, 2005.
|
| |
17
|
T. Kelly. Priority depth (generalized stack distance) implementation in ANSI C, 2000. http://ai.eecs.umich.edu/œtpkelly/papers/.
|
| |
18
|
T. Kelly and D. Reeves. Optimal web cache sizing: scalable methods for exact solutions. Computer Communications, 24(2):163--173, 2001.
|
| |
19
|
|
| |
20
|
|
| |
21
|
R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78, 1970.
|
 |
22
|
|
| |
23
|
|
| |
24
|
R. Prelinger. www.prelinger.com, 2008. http://www.panix.com/~footage/.
|
| |
25
|
T. Schwarz, M. Baker, S. Bassi, B. Baumgart, W. Flagg, C. van Ingen, K. Joste, M. Manasse, and M. Shah. Disk failure investigations at the internet archive. In MSST2006: 23rd IEEE, 14th NASA Goddard Conference on Mass Storage Systems and Technologies, May 2006.
|
| |
26
|
S. Technology. Seagate technology - cheetah hard drive family, 2009. http://www.seagate.com/www/en-us/products/servers/cheetah/.
|
| |
27
|
Yahoo!, 2008. http://developer.yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html.
|
|