|
ABSTRACT
We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multiple, globally consistent checkpoints of the state of a shared memory multiprocessor (i.e., processors, memory, and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses "logically atomic" coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes performance overhead by pipelining checkpoint validation with subsequent parallel execution.We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an interconnection network switch (and its buffered messages). Using full-system simulation of a 16-way multiprocessor running commercial workloads, we find that SafetyNet (a) adds statistically insignificant runtime overhead in the common-case of fault-free execution, and (b) avoids a crash when tolerated faults occur.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. E. Ahmed, R. C. Frazier, and P. N. Marinos. Cache-Aided Rollback Error Recovery (CARER) Algorithms for Shared-Memory Multiprocessor Systems. In Proceedings of the 20th International Symposium on Fault-Tolerant Computing Systems, pages 82-88, June 1990.
|
| |
2
|
A. R. Alameldeen, C. J. Mauer, M. Xu, P. J. Harper, M. M. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood. Evaluating Non-deterministic Multi-threaded Commercial Workloads. In Proceedings of the Fifth Workshop on Computer Architecture Evaluation Using Commercial Workloads, pages 30-38, Feb. 2002.
|
| |
3
|
R. Anglada and A. Rubio. An Approach to Crosstalk Effect Analyses and Avoidance Techniques in Digital CMOS VLSI Circuits. International Journal of Electronics, 6(5):9-17, 1988.
|
| |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
|
| |
8
|
M. Bohr. Interconnect Scaling - The Real Limiter to High Performance. In Proceedings of the International Electron Devices Meeting, pages 241-244, Dec. 1995.
|
| |
9
|
J. F. Cantin, M. H. Lipasti, and J. E. Smith. Dynamic Verification of Cache Coherence Protocols. In Workshop on Memory Performance Issues, June 2001. In conjunction with ISCA.
|
| |
10
|
|
 |
11
|
|
| |
12
|
W. J. Dally, L. R. Dennison, D. Harris, K. Kan, and T. Xanthopoulos. Architecture and Implementation of Reliable Router. In Proceedings of 2nd Hot Interconnects Symposium, Aug. 1994.
|
| |
13
|
T. J. Dell. A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory. IBM Microelectronics Division Whitepaper, Nov. 1997.
|
| |
14
|
|
| |
15
|
E. Elnozahy, D. Johnson, and Y. Wang. A Survey Rollback-Recovery Protocols in Message-Passing Systems. Technical Report CMU-CS-96-181, Department of Computer Science, Carnegie Mellon University, Sept. 1996
|
| |
16
|
|
| |
17
|
S. J. Frank. Tightly Coupled Multiprocessor System Speeds Memory-access Times. Electronics, 57(1):164-169, Jan 1984.
|
| |
18
|
|
 |
19
|
Chris Gniady , Babak Falsafi , T. N. Vijaykumar, Is SC + ILP = RC?, Proceedings of the 26th annual international symposium on Computer architecture, p.162-171, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
20
|
|
| |
21
|
|
 |
22
|
|
| |
23
|
R. Gustafson and F. Sparacio. IBM 3081 Processor Unit: Design Considerations and Design Process. IBM Journal of Research and Development, 26:12-21, Jan. 1982.
|
| |
24
|
D. Hunt and P. Marinos. A General Purpose Cache-Aided Rollback Error Recovery (CARER) Technique. In Proceedings of the 17th International Symposium on Fault-Tolerant Computing Systems, pages 170-175, 1987.
|
| |
25
|
IEEE Computer Society. IEEE Standard for Scalable Coherent Interface (SCI), Aug. 1993.
|
| |
26
|
D. Jewett. Integrity S2: A Fault-Tolerant UNIX Platform. In Proceedings of the 21st International Symposium on Fault Tolerant Computing Systems, pages 512-519, June 1991.
|
| |
27
|
D. Johnson. The Intel 432: A VLSI Architecture for Fault-Tolerant Computing. IEEE Computer, pages 40-48, Aug 1984.
|
| |
28
|
T. Juhnke and H. Klar. Calculation of the Soft Error Rate of Submicron CMOS Logic Circuits. IEEE Journal of Solid-State Circuits, 30(7):830-834, July 1995.
|
| |
29
|
Peter S. Magnusson , Magnus Christensson , Jesper Eskilson , Daniel Forsgren , Gustav Hållberg , Johan Högberg , Fredrik Larsson , Andreas Moestedt , Bengt Werner, Simics: A Full System Simulation Platform, Computer, v.35 n.2, p.50-58, February 2002
[doi> 10.1109/2.982916]
|
| |
30
|
Jeffrey Oplinger , David Heine , Shih Liao , Basem A. Nayfeh , Monica S. Lam , Kunle Olukotun, Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor, Stanford University, Stanford, CA, 1997
|
 |
31
|
David A. Patterson , Garth Gibson , Randy H. Katz, A case for redundant arrays of inexpensive disks (RAID), Proceedings of the 1988 ACM SIGMOD international conference on Management of data, p.109-116, June 01-03, 1988, Chicago, Illinois, United States
|
| |
32
|
|
| |
33
|
|
 |
34
|
Parthasarathy Ranganathan , Vijay S. Pai , Sarita V. Adve, Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models, Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, p.199-210, June 23-25, 1997, Newport, Rhode Island, United States
[doi> 10.1145/258492.258512]
|
 |
35
|
|
| |
36
|
J. Robertson. Alpha Particles Worry IC Makers as Device Features Keep Shrinking. Semiconductor Business News, October 21, 1998.
|
| |
37
|
|
| |
38
|
O. Serlin. Fault-Tolerant Systems in Commercial Applications. IEEE Computer, pages 19-30, Aug. 1984.
|
| |
39
|
K. Seshan, T. Maloney, and K. Wu. The Quality and Reliability of Intel's Quarter Micron Process. Intel Technology Journal, Sept. 1998.
|
| |
40
|
L. Spainhower and T. A. Gregg. IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective. IBM Journal of Research and Development, 43(5/6), September/November 1999.
|
| |
41
|
|
 |
42
|
|
| |
43
|
|
| |
44
|
|
| |
45
|
|
 |
46
|
Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, The SPLASH-2 programs: characterization and methodological considerations, Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy
|
| |
47
|
|
| |
48
|
|
| |
49
|
|
CITED BY 34
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jared C. Smolens , Brian T. Gold , Jangwoo Kim , Babak Falsafi , James C. Hoe , Andreas G. Nowatzyk, Fingerprinting: Bounding Soft-Error-Detection Latency and Bandwidth, IEEE Micro, v.24 n.6, p.22-29, November 2004
|
|
|
Jared C. Smolens , Brian T. Gold , Jangwoo Kim , Babak Falsafi , James C. Hoe , Andreas G. Nowatzyk, Fingerprinting: Bounding Soft-Error-Detection Latency and Bandwidth, IEEE Micro, v.24 n.6, p.22-29, November 2004
|
|
|
|
|
|
Rosalia Christodoulopoulou , Kaloian Manassiev , Angelos Bilas , Cristiana Amza, Fast and transparent recovery for continuous availability of cluster-based servers, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, March 29-31, 2006, New York, New York, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Smruti Sarangi , Satish Narayanasamy , Bruce Carneal , Abhishek Tiwari , Brad Calder , Josep Torrellas, Patching Processor Design Errors with Programmable Hardware, IEEE Micro, v.27 n.1, p.12-25, January 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|