| DRAM errors in the wild: a large-scale field study |
| Full text |
Pdf
(535 KB)
|
Source
|
Joint International Conference on Measurement and Modeling of Computer Systems
archive
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
table of contents
Seattle, WA, USA
SESSION: Memory and storage
table of contents
Pages 193-204
Year of Publication: 2009
ISBN:978-1-60558-511-6
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 189, Downloads (12 Months): 339, Citation Count: 0
|
|
|
ABSTRACT
Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days. The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age? We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don't observe any indication that newer generations of DIMMs have worse error behavior.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Mosys adds soft-error protection, correction. Semiconductor Business News, 28 Jan. 2002.
|
| |
2
|
|
| |
3
|
|
| |
4
|
Fay Chang , Jeffrey Dean , Sanjay Ghemawat , Wilson C. Hsieh , Deborah A. Wallach , Mike Burrows , Tushar Chandra , Andrew Fikes , Robert E. Gruber, Bigtable: a distributed storage system for structured data, Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, p.15-15, November 06-08, 2006, Seattle, WA
|
| |
5
|
Fay Chang , Jeffrey Dean , Sanjay Ghemawat , Wilson C. Hsieh , Deborah A. Wallach , Mike Burrows , Tushar Chandra , Andrew Fikes , Robert E. Gruber, Bigtable: a distributed storage system for structured data, Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, p.15-15, November 06-08, 2006, Seattle, WA
|
| |
6
|
Fay Chang , Jeffrey Dean , Sanjay Ghemawat , Wilson C. Hsieh , Deborah A. Wallach , Mike Burrows , Tushar Chandra , Andrew Fikes , Robert E. Gruber, Bigtable: a distributed storage system for structured data, Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, p.15-15, November 06-08, 2006, Seattle, WA
|
| |
7
|
|
| |
8
|
T. Hamamoto, S. Sugiura, and S. Sawada. On the retention time distribution of dynamic random access memory (dram). IEEE Transactions on Electron Devices, 45(6):1300--1309, 1998.
|
| |
9
|
A. H. Johnston. Scaling and technology issues for soft error rates. In Proc. of the 4th Annual Conf. on Reliability, 2000.
|
| |
10
|
Xin Li , Kai Shen , Michael C. Huang , Lingkun Chu, A memory soft error measurement on production systems, 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, p.1-6, June 17-22, 2007, Santa Clara, CA
|
| |
11
|
T. C. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1), 1979.
|
| |
12
|
T. C. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1), 1979.
|
 |
13
|
Dejan Milojicic , Alan Messer , James Shau , Guangrui Fu , Alberto Munoz, Increasing relevance of memory hardware errors: a case for recoverable programming models, Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system, September 17-20, 2000, Kolding, Denmark
[doi> 10.1145/566726.566749]
|
| |
14
|
|
| |
15
|
|
| |
16
|
E. Normand. Single event upset at ground level. IEEE Transaction on Nuclear Sciences, 6(43):2742--2750, 1996.
|
| |
17
|
T. J. O'Gorman , J. M. Ross , A. H. Taber , J. F. Ziegler , H. P. Muhlfeld , C. J. Montrose , H. W. Curtis , J. L. Walsh, Field testing for cosmic ray soft errors in semiconductor memories, IBM Journal of Research and Development, v.40 n.1, p.41-50, Jan. 1996
|
| |
18
|
|
| |
19
|
|
| |
20
|
Bianca Schroeder , Garth A. Gibson, Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?, Proceedings of the 5th USENIX conference on File and Storage Technologies, p.1-es, February 13-16, 2007, San Jose, CA
|
| |
21
|
Bianca Schroeder , Garth A. Gibson, Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?, Proceedings of the 5th USENIX conference on File and Storage Technologies, p.1-es, February 13-16, 2007, San Jose, CA
|
| |
22
|
|
| |
23
|
J. F. Ziegler and W. A. Lanford. Effect of Cosmic Rays on Computer Memories. Science, 206:776--788, 1979.
|
INDEX TERMS
Primary Classification:
C.
Computer Systems Organization
C.4
PERFORMANCE OF SYSTEMS
Subjects:
Reliability, availability, and serviceability
General Terms:
Reliability
Keywords:
data corruption,
dimm,
dram,
dram reliability,
ecc,
empirical study,
hard error,
large-scale systems,
memory,
soft error
|