|
ABSTRACT
A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal transfers. In this paper, we propose a new scheme for provably safe and reliable computing in the presence of transient hardware faults. In our scheme, software computations are replicated to provide redundancy while special instructions compare the independently computed results to detect errors before writing critical data. In stark contrast to any previous efforts in this area, we have analyzed our fault tolerance scheme from a formal, theoretical perspective. To be specific, first, we provide an operational semantics for our assembly language, which includes a precise formal definition of our fault model. Second, we develop an assembly-level type system designed to detect reliability problems in compiled code. Third, we provide a formal specification for program fault tolerance under the given fault model and prove that all well-typed programs are indeed fault tolerant. In addition to the formal analysis, we evaluate our detection scheme and show that it only takes 34% longer to execute than the unreliable version.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. C. Baumann. Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Transactions on Device and Materials Reliability, 1(1):17--22, March 2001.
|
| |
2
|
R. C. Baumann. Soft errors in commercial semiconductor technology: Overview and scaling trends. In IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pages 121 01.1--121 01.14, April 2002.
|
| |
3
|
|
 |
4
|
|
 |
5
|
Robert W. Horst , Richard L. Harris , Robert L. Jardine, Multiple instruction issue in the NonStop cyclone processor, Proceedings of the 17th annual international symposium on Computer Architecture, p.216-226, May 28-31, 1990, Seattle, Washington, United States
|
| |
6
|
|
| |
7
|
S. E. Michalak, K. W. Harris, N. W. Hengartner, B. E. Takala, and S. A. Wender. Predicting the number of fatal soft errors in Los Alamos National Labratory's ASC Q computer. IEEE Transactions on Device and Materials Reliability, 5(3):329--335, September 2005.
|
 |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
T. J. O'Gorman , J. M. Ross , A. H. Taber , J. F. Ziegler , H. P. Muhlfeld , C. J. Montrose , H. W. Curtis , J. L. Walsh, Field testing for cosmic ray soft errors in semiconductor memories, IBM Journal of Research and Development, v.40 n.1, p.41-50, Jan. 1996
|
| |
12
|
N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow checking by software signatures. In IEEE Transactions on Reliability, volume 51, pages 111--122, March 2002.
|
| |
13
|
N. Oh, P. P. Shirvani, and E. J. McCluskey. Error detection by duplicated instructions in super-scalar processors. In IEEE Transactions on Reliability, volume 51, pages 63--75, March 2002.
|
| |
14
|
|
| |
15
|
F. Perry, L.Mackey, G. A. Reis, J. Ligatti, D. I. August, and D.Walker. Fault-tolerant typed assembly language. Technical Report TR--776--07, Princeton University, 2007.
|
 |
16
|
|
| |
17
|
|
| |
18
|
|
 |
19
|
George A. Reis , Jonathan Chang , Neil Vachharajani , Ram Rangan , David I. August , Shubhendu S. Mukherjee, Design and Evaluation of Hybrid Fault-Detection Systems, Proceedings of the 32nd annual international symposium on Computer Architecture, p.148-159, June 04-08, 2005
|
| |
20
|
P. P. Shirvani, N. Saxena, and E. J. McCluskey. Softwareimplemented EDAC protection against SEUs. In IEEE Transactions on Reliability, volume 49, pages 273--284, 2000.
|
| |
21
|
|
| |
22
|
Timothy J. Slegel , Robert M. Averill III , Mark A. Check , Bruce C. Giamei , Barry W. Krumm , Christopher A. Krygowski , Wen H. Li , John S. Liptay , John D. MacDougall , Thomas J. McPherson , Jennifer A. Navarro , Eric M. Schwarz , Kevin Shum , Charles F. Webb, IBM's S/390 G5 Microprocessor Design, IEEE Micro, v.19 n.2, p.12-23, March 1999
[doi> 10.1109/40.755464]
|
 |
23
|
Spyridon Triantafyllis , Matthew J. Bridges , Easwaran Raman , Guilherme Ottoni , David I. August, A framework for unrestricted whole-program optimization, Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, June 11-14, 2006, Ottawa, Ontario, Canada
|
| |
24
|
R. Venkatasubramanian, J. P. Hayes, and B. T. Murray. Low-cost on-line fault detection using control flow assertions. In Proceedings of the 9th IEEE International On-Line Testing Symposium, pages 137--143, July 2003.
|
 |
25
|
|
 |
26
|
David Walker , Lester Mackey , Jay Ligatti , George A. Reis , David I. August, Static typing for a faulty lambda calculus, Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming, September 16-21, 2006, Portland, Oregon, USA
|
| |
27
|
Y. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedings of the 1996 IEEE Aerospace Applications Conference, volume 1, pages 293--307, February 1996.
|
| |
28
|
J. F. Ziegler and H. Puchner. SER-History, Trends, and Challenges: A Guide for Designing with Memory ICs. 2004.
|
|