ACM Home Page
Please provide us with feedback. Feedback
Soft error vulnerability of iterative linear algebra methods
Full text PdfPdf (3.63 MB)
Source
International Conference on Supercomputing archive
Proceedings of the 22nd annual international conference on Supercomputing table of contents
Island of Kos, Greece
SESSION: Fault tolerance table of contents
Pages 155-164  
Year of Publication: 2008
ISBN:978-1-60558-158-3
Authors
Greg Bronevetsky  Lawrence Livermore National Laboratory, Livermore, CA, USA
Bronis de Supinski  Lawrence Livermore National Laboratory, Livermore, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 89,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1375527.1375552
What is a DOI?

ABSTRACT

Devices are increasingly vulnerable to soft errors as their feature sizes shrink. Previously, soft error rates were significant primarily in space and high-atmospheric computing. Modern architectures now use features so small at sufficiently low voltages that soft errors are becoming important even at terrestrial altitudes. Due to their large number of components, supercomputers are particularly susceptible to soft errors. Since many large scale parallel scientific applications use iterative linear algebra methods, the soft error vulnerability of these methods constitutes a large fraction of the applications' overall vulnerability. Many users consider these methods invulnerable to most soft errors since they converge from an imprecise solution to a precise one. However, we show in this paper that iterative methods are vulnerable to soft errors, exhibiting both silent data corruptions and poor ability to detect errors. Further, we evaluate a variety of soft error detection and tolerance techniques, including checkpointing, linear matrix encodings, and residual tracking techniques.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
International technology roadmap for semiconductors. White paper, ITRS, 2005.
 
2
Jesd89a: Measurement and reporting of alpha particle and terrestrial cosmic ray-induced soft errors in semiconductor devices. Technical report, JEDEC Solid State Technology Association, October 2006.
 
3
 
4
 
5
 
6
R. C. Baumann. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability, 5(3):305--316, September 2005.
 
7
 
8
 
9
Tim Davis. University of Florida Sparse Matrix Collection. NA Digest, 97(23), June 1997.
 
10
J. Dongarra, A. Lumsdaine, R. Pozo, and K. Remington. A sparse matrix library in C++ for high performance architectures. In Object Oriented Numerics Conference, pages 214--218, 1994.
 
11
 
12
Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins University Press, 1996.
 
13
J. Greenough, L. Howell A. Kuhl, A. Shestakov, U. Creach, A.Miller, E. Tarwater, A. Cook, and B. Cabot. Raptor: Software and applications on BlueGene/L. In BlueGene/L Workshop, October 2003.
 
14
David M. Hiemstra and Allan Baril. Single event upset characterization of the pentium mmx and pentium II microprocessors using proton. IEEE Transactions on Nuclear Science, 46(6):1453--1460, December 1999.
 
15
 
16
P. Kudva, Jeffrey W. Kellington, Pia N. Sanda, Ryan McBeth, John Schumann, and Ron Kalla. Soft error derating of ibm power6 microprocessor using statistical fault injection. In IEEE Workshop on Silicon Errors in Logic -- System Effects, April 2007.
 
17
 
18
Austin Lesea and Joe Fabula. The Rosetta experiment: Atmospheric soft error rate testing in differing technology FPGAs -- 90 nanometer update. In Workshop on System Effects of Logic Soft Errors, April 2005.
 
19
Hatem Ltaief, Marc Garbey, and Edgar Gabriel. Parallel fault tolerant algorithms for parabolic problems. In Euro-Par Conference on Parallel Processing, pages 700--709, November 2006.
 
20
M. A. McClelland, J. L. Maienschein, A. L. Nichols, J. F. Wardell, A. I. Atwood, and P. O. Curran. ALE3D model predictions and materials characterization for the cookoff response. In Joint Army Navy NASA Air Force 38th Combustions Subcommittee, 26th Airbreathing Propulsion Subcommittee, 20th Propulsion Systems Hazards Subcommittee and 2nd Modeling and Simulation Subcommittee Joint Meeting, March 2007.
 
21
P.T. McDonald, W.J. Stapor, and B.G. Henson. PC603E 32-bit RISC microprocessor radiation effects study. White paper, Innovative Concepts Inc., 1999.
 
22
 
23
Sarah Michalak, Kevin W. Harris, Nicolas W. Hengartner, Bruce E. Takala, and Stephen A. Wender. Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer. IEEE Transactions on Device and Materials Reliability, 5(3), 2005.
 
24
 
25
Couchman H. M. P., Thomas P. A., and Pearce F. R. Hydra: an adaptive-mesh implementation of SPH. Astrophysical Journal, 452:797--813, April 1995.
 
26
 
27
 
28
A. Roy-Chowdhury and P. Banerjee. Algorithm-based fault location and recovery for matrix computations. In International Symposium on Fault-Tolerant Computing, June 1994.
 
29
Terrazon Semiconductor. Soft errors in electronic memory. White paper, Terrazon Semiconductor, 2004.
 
30
 
31
Daniel Skarin, Martin Sanfridson, and Johan Karlsson. Impact of soft errors in a brake-by-wire system. In IEEE Workshop on Silicon Errors in Logic -- System Effects, April 2007.
32
 
33
Hamid R. Zarandi and Seyed Ghassem Miremadi. Dependability evaluation of altera FPGA-based embedded systems subjected to SEUs. Microelectronics and Reliability, 47(2--3):461--470, 2006.
 
34
Qihong Zhang and Jung H. Kim. An efficient method to reduce roundoff error in matrix multiplication with algorithm-based fault tolerance. In International Conference on Wafer Scale Integration, pages 32--39, January 1994

Collaborative Colleagues:
Greg Bronevetsky: colleagues
Bronis de Supinski: colleagues