ACM Home Page
Please provide us with feedback. Feedback
Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics
Full text PdfPdf (263 KB)
Source
ACM Transactions on Storage (TOS) archive
Volume 4 ,  Issue 3  (November 2008) table of contents
Article No. 7  
Year of Publication: 2008
ISSN:1553-3077
Authors
Weihang Jiang  University of Illinois at Urbana Champaign, Urbana, IL
Chongfeng Hu  University of Illinois at Urbana Champaign, Urbana, IL
Yuanyuan Zhou  University of Illinois at Urbana Champaign, Urbana, IL
Arkady Kanevsky  Network Appliance, Inc., Sunnyvale, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 22,   Downloads (12 Months): 236,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1416944.1416946
What is a DOI?

ABSTRACT

Building reliable storage systems becomes increasingly challenging as the complexity of modern storage systems continues to grow. Understanding storage failure characteristics is crucially important for designing and building a reliable storage system. While several recent studies have been conducted on understanding storage failures, almost all of them focus on the failure characteristics of one component—disks—and do not study other storage component failures.

This article analyzes the failure characteristics of storage subsystems. More specifically, we analyzed the storage logs collected from about 39,000 storage systems commercially deployed at various customer sites. The dataset covers a period of 44 months and includes about 1,800,000 disks hosted in about 155,000 storage-shelf enclosures. Our study reveals many interesting findings, providing useful guidelines for designing reliable storage systems. Some of our major findings include: (1) In addition to disk failures that contribute to 20--55% of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for a significant percentage of storage subsystem failures. (2) Each individual storage subsystem failure type, and storage subsystem failure as a whole, exhibits strong self-correlations. In addition, these failures exhibit “bursty” patterns. (3) Storage subsystems configured with redundant interconnects experience 30--40% lower failure rates than those with a single interconnect. (4) Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
 
4
Cole, G. 2000. Estimating drive reliability in desktop computers and consumer electronics systems. Tech. Rep., Seagate Technology Paper TP-338.1.
 
5
 
6
Elerath, J. G. and Shah, S. 2004. Server class disk drives: How reliable are they. In Proceedings of the IEEE Reliability and Maintainability Symposium, 151--156.
 
7
Elerath, J. G. and Shah, S. 2003. Disk drive reliability case study: Dependence upon head fly-height and quantity of heads. In Proceedings of the Reliability and Maintainability Symposium, 608--612.
 
8
EMC. 2007. EMC symmetrix DMX-4 specification sheet. http://www.emc.com/products/systems/symmetrix/symmetri_DMX1000/pdf/DMX3000.pdf.
9
 
10
Gray, J. 1990. A census of tandem system availability between 1985 and 1990. In Proceedings of the IEEE Transactions on Reliability.
 
11
 
12
NetApp. 2008. FAS6000 series technical specifications. http://www.netapp.com/products/filer/fas6000_tech_specs.html.
13
 
14
 
15
Rosander, A. C. 1951. Elementary Principles of Statistics. D. Van Nostrand Company.
 
16
 
17
Schulze, M., Gibson, G. A., Katz, R. H., and Patterson, D. A. 1989. How reliable is a RAID? In Proceedings of the COMPCON. 118--123.
 
18
Shah, S. and Elerath, J. G. 2005. Reliability analysis of disk drive failure mechanisms. In Proceedings of the IEEE Reliability and Maintainability Symposium, 226--231.
 
19
SNIA. 2008. Storage Networking Industry Association dictionary. http://www.snia.org/education/dictionary/.
 
20
 
21
Yang, J. and Sun, F.-B. 1999. A comprehensive review of hard-disk drive reliability. In Proceedings of the Reliability and Maintainability Symposium, 403--409.

Collaborative Colleagues:
Weihang Jiang: colleagues
Chongfeng Hu: colleagues
Yuanyuan Zhou: colleagues
Arkady Kanevsky: colleagues