ACM Home Page
Please provide us with feedback. Feedback
An analysis of data corruption in the storage stack
Full text PdfPdf (422 KB)
Source
ACM Transactions on Storage (TOS) archive
Volume 4 ,  Issue 3  (November 2008) table of contents
Article No. 8  
Year of Publication: 2008
ISSN:1553-3077
Authors
Lakshmi N. Bairavasundaram  University of Wisconsin-Madison, Madison, WI
Andrea C. Arpaci-Dusseau  University of Wisconsin-Madison, Madison, WI
Remzi H. Arpaci-Dusseau  University of Wisconsin-Madison, Madison, WI
Garth R. Goodson  NetApp, Sunnyvale, CA
Bianca Schroeder  University of Toronto, Toronto, ON
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 23,   Downloads (12 Months): 211,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1416944.1416947
What is a DOI?

ABSTRACT

An important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this article, we present the first large-scale study of data corruption. We analyze corruption instances recorded in production storage systems containing a total of 1.53 million disk drives, over a period of 41 months. We study three classes of corruption: checksum mismatches, identity discrepancies, and parity inconsistencies. We focus on checksum mismatches since they occur the most.

We find more than 400,000 instances of checksum mismatches over the 41-month period. We find many interesting trends among these instances, including: (i) nearline disks (and their adapters) develop checksum mismatches an order of magnitude more often than enterprise-class disk drives, (ii) checksum mismatches within the same disk are not independent events and they show high spatial and temporal locality, and (iii) checksum mismatches across different disks in the same storage system are not independent. We use our observations to derive lessons for corruption-proof system design.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
4
 
5
 
6
Darden, M. H. 2002. Data integrity: The Dell—EMC distinction. http://www.dell.com/content/topics/global.aspx/power/en/ps2q02_darden?c=us&cs=555&l=en&s=biz.
 
7
Elerath, J. G. and Shah, S. 2004. Server class disk drives: How reliable are they. In Proceedings of the 50th Annual Reliability and Maintainability Symposium, Los Angeles, CA, 151--156.
8
 
9
 
10
 
11
 
12
13
 
14
15
 
16
 
17
Shah, S. and Elerath, J. G. 2005. Reliability analyses of disk drive failure mechanisms. In Proceedings of the 51st Annual Reliability and Maintainability Symposium, Alexandria, VA, 226--231.
 
18
Shah, S. and Elerath, J. G. 2004. Disk drive vintage and its effect on reliability. In Proceedings of the 50th Annual Reliability and Maintainability Symposium, Los Angeles, CA, 163--167.
19
 
20
 
21
Sun Microsystems. 2006. ZFS: The last word in file systems. www.sun.com/2004-0914/feature/.
 
22
Sundaram, R. 2006. The private lives of disk drives. http://www.netapp.com/go/techontap/matl/sample/0206tot_resiliency.html.
 
23
Weber. 1998. Information technology: SCSI primary commands (SPC-2). Tech. Rep. T10 Project 1236-D Revision 5. September.

Collaborative Colleagues:
Lakshmi N. Bairavasundaram: colleagues
Andrea C. Arpaci-Dusseau: colleagues
Remzi H. Arpaci-Dusseau: colleagues
Garth R. Goodson: colleagues
Bianca Schroeder: colleagues