| An analysis of data corruption in the storage stack |
| Full text |
Pdf
(422 KB)
|
Source
|
ACM Transactions on Storage (TOS)
archive
Volume 4 , Issue 3 (November 2008)
table of contents
Article No. 8
Year of Publication: 2008
ISSN:1553-3077
|
|
Authors
|
|
Lakshmi N. Bairavasundaram
|
University of Wisconsin-Madison, Madison, WI
|
|
Andrea C. Arpaci-Dusseau
|
University of Wisconsin-Madison, Madison, WI
|
|
Remzi H. Arpaci-Dusseau
|
University of Wisconsin-Madison, Madison, WI
|
|
Garth R. Goodson
|
NetApp, Sunnyvale, CA
|
|
Bianca Schroeder
|
University of Toronto, Toronto, ON
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 23, Downloads (12 Months): 211, Citation Count: 0
|
|
|
ABSTRACT
An important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this article, we present the first large-scale study of data corruption. We analyze corruption instances recorded in production storage systems containing a total of 1.53 million disk drives, over a period of 41 months. We study three classes of corruption: checksum mismatches, identity discrepancies, and parity inconsistencies. We focus on checksum mismatches since they occur the most. We find more than 400,000 instances of checksum mismatches over the 41-month period. We find many interesting trends among these instances, including: (i) nearline disks (and their adapters) develop checksum mismatches an order of magnitude more often than enterprise-class disk drives, (ii) checksum mismatches within the same disk are not independent events and they show high spatial and temporal locality, and (iii) checksum mismatches across different disks in the same storage system are not independent. We use our observations to derive lessons for corruption-proof system design.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Guillermo A. Alvarez , Walter A. Burkhard , Flaviu Cristian, Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering, Proceedings of the 24th annual international symposium on Computer architecture, p.62-72, June 01-04, 1997, Denver, Colorado, United States
|
 |
2
|
Lakshmi N. Bairavasundaram , Garth R. Goodson , Shankar Pasupathy , Jiri Schindler, An analysis of latent sector errors in disk drives, Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, June 12-16, 2007, San Diego, California, USA
|
| |
3
|
|
 |
4
|
M. Blaum , J. Brady , J. Bruck , J. Menon, EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures, Proceedings of the 21st annual international symposium on Computer architecture, p.245-254, April 18-21, 1994, Chicago, Illinois, United States
|
| |
5
|
Peter Corbett , Bob English , Atul Goel , Tomislav Grcanac , Steven Kleiman , James Leong , Sunitha Sankar, Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction, Proceedings of the 3rd USENIX Conference on File and Storage Technologies, March 31-31, 2004, San Francisco, CA
|
| |
6
|
Darden, M. H. 2002. Data integrity: The Dell—EMC distinction. http://www.dell.com/content/topics/global.aspx/power/en/ps2q02_darden?c=us&cs=555&l=en&s=biz.
|
| |
7
|
Elerath, J. G. and Shah, S. 2004. Server class disk drives: How reliable are they. In Proceedings of the 50th Annual Reliability and Maintainability Symposium, Los Angeles, CA, 151--156.
|
 |
8
|
|
| |
9
|
|
| |
10
|
James Lee Hafner , Veera Deenadhayalan , K. K. Rao , John A. Tomlin, Matrix methods for lost data reconstruction in erasure codes, Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies, p.14-14, December 13-16, 2005, San Francisco, CA
|
| |
11
|
Weihang Jiang , Chongfeng Hu , Yuanyuan Zhou , Arkady Kanevsky, Are disks the dominant contributor for storage failures?: a comprehensive study of storage subsystem failure characteristics, Proceedings of the 6th USENIX Conference on File and Storage Technologies, p.1-15, February 26-29, 2008, San Jose, California
|
| |
12
|
|
 |
13
|
David A. Patterson , Garth Gibson , Randy H. Katz, A case for redundant arrays of inexpensive disks (RAID), Proceedings of the 1988 ACM SIGMOD international conference on Management of data, p.109-116, June 01-03, 1988, Chicago, Illinois, United States
|
| |
14
|
|
 |
15
|
Vijayan Prabhakaran , Lakshmi N. Bairavasundaram , Nitin Agrawal , Haryadi S. Gunawi , Andrea C. Arpaci-Dusseau , Remzi H. Arpaci-Dusseau, IRON file systems, Proceedings of the twentieth ACM symposium on Operating systems principles, October 23-26, 2005, Brighton, United Kingdom
|
| |
16
|
Bianca Schroeder , Garth A. Gibson, Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?, Proceedings of the 5th USENIX conference on File and Storage Technologies, p.1-es, February 13-16, 2007, San Jose, CA
|
| |
17
|
Shah, S. and Elerath, J. G. 2005. Reliability analyses of disk drive failure mechanisms. In Proceedings of the 51st Annual Reliability and Maintainability Symposium, Alexandria, VA, 226--231.
|
| |
18
|
Shah, S. and Elerath, J. G. 2004. Disk drive vintage and its effect on reliability. In Proceedings of the 50th Annual Reliability and Maintainability Symposium, Los Angeles, CA, 163--167.
|
 |
19
|
|
| |
20
|
|
| |
21
|
Sun Microsystems. 2006. ZFS: The last word in file systems. www.sun.com/2004-0914/feature/.
|
| |
22
|
Sundaram, R. 2006. The private lives of disk drives. http://www.netapp.com/go/techontap/matl/sample/0206tot_resiliency.html.
|
| |
23
|
Weber. 1998. Information technology: SCSI primary commands (SPC-2). Tech. Rep. T10 Project 1236-D Revision 5. September.
|
|