ACM Home Page
Please provide us with feedback. Feedback
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery
Full text PdfPdf (1.28 MB)
Source ACM SIGARCH Computer Architecture News archive
Volume 30 ,  Issue 2  (May 2002) table of contents
Special Issue: Proceedings of the 29th annual international symposium on Computer architecture (ISCA '02)
SESSION: Session 3: Safety and reliability table of contents
Pages: 123 - 134  
Year of Publication: 2002
ISSN:0163-5964
Also published in ...
Authors
Daniel J. Sorin  University of Wisconsin---Madison
Milo M. K. Martin  University of Wisconsin---Madison
Mark D. Hill  University of Wisconsin---Madison
David A. Wood  University of Wisconsin---Madison
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 18,   Downloads (12 Months): 80,   Citation Count: 34
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/545214.545229
What is a DOI?

ABSTRACT

We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multiple, globally consistent checkpoints of the state of a shared memory multiprocessor (i.e., processors, memory, and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses "logically atomic" coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes performance overhead by pipelining checkpoint validation with subsequent parallel execution.We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an interconnection network switch (and its buffered messages). Using full-system simulation of a 16-way multiprocessor running commercial workloads, we find that SafetyNet (a) adds statistically insignificant runtime overhead in the common-case of fault-free execution, and (b) avoids a crash when tolerated faults occur.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
R. E. Ahmed, R. C. Frazier, and P. N. Marinos. Cache-Aided Rollback Error Recovery (CARER) Algorithms for Shared-Memory Multiprocessor Systems. In Proceedings of the 20th International Symposium on Fault-Tolerant Computing Systems, pages 82-88, June 1990.
 
2
A. R. Alameldeen, C. J. Mauer, M. Xu, P. J. Harper, M. M. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood. Evaluating Non-deterministic Multi-threaded Commercial Workloads. In Proceedings of the Fifth Workshop on Computer Architecture Evaluation Using Commercial Workloads, pages 30-38, Feb. 2002.
 
3
R. Anglada and A. Rubio. An Approach to Crosstalk Effect Analyses and Avoidance Techniques in Digital CMOS VLSI Circuits. International Journal of Electronics, 6(5):9-17, 1988.
 
4
 
5
6
 
7
 
8
M. Bohr. Interconnect Scaling - The Real Limiter to High Performance. In Proceedings of the International Electron Devices Meeting, pages 241-244, Dec. 1995.
 
9
J. F. Cantin, M. H. Lipasti, and J. E. Smith. Dynamic Verification of Cache Coherence Protocols. In Workshop on Memory Performance Issues, June 2001. In conjunction with ISCA.
 
10
11
 
12
W. J. Dally, L. R. Dennison, D. Harris, K. Kan, and T. Xanthopoulos. Architecture and Implementation of Reliable Router. In Proceedings of 2nd Hot Interconnects Symposium, Aug. 1994.
 
13
T. J. Dell. A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory. IBM Microelectronics Division Whitepaper, Nov. 1997.
 
14
 
15
E. Elnozahy, D. Johnson, and Y. Wang. A Survey Rollback-Recovery Protocols in Message-Passing Systems. Technical Report CMU-CS-96-181, Department of Computer Science, Carnegie Mellon University, Sept. 1996
 
16
 
17
S. J. Frank. Tightly Coupled Multiprocessor System Speeds Memory-access Times. Electronics, 57(1):164-169, Jan 1984.
 
18
19
 
20
 
21
22
 
23
R. Gustafson and F. Sparacio. IBM 3081 Processor Unit: Design Considerations and Design Process. IBM Journal of Research and Development, 26:12-21, Jan. 1982.
 
24
D. Hunt and P. Marinos. A General Purpose Cache-Aided Rollback Error Recovery (CARER) Technique. In Proceedings of the 17th International Symposium on Fault-Tolerant Computing Systems, pages 170-175, 1987.
 
25
IEEE Computer Society. IEEE Standard for Scalable Coherent Interface (SCI), Aug. 1993.
 
26
D. Jewett. Integrity S2: A Fault-Tolerant UNIX Platform. In Proceedings of the 21st International Symposium on Fault Tolerant Computing Systems, pages 512-519, June 1991.
 
27
D. Johnson. The Intel 432: A VLSI Architecture for Fault-Tolerant Computing. IEEE Computer, pages 40-48, Aug 1984.
 
28
T. Juhnke and H. Klar. Calculation of the Soft Error Rate of Submicron CMOS Logic Circuits. IEEE Journal of Solid-State Circuits, 30(7):830-834, July 1995.
 
29
 
30
31
 
32
 
33
34
35
 
36
J. Robertson. Alpha Particles Worry IC Makers as Device Features Keep Shrinking. Semiconductor Business News, October 21, 1998.
 
37
 
38
O. Serlin. Fault-Tolerant Systems in Commercial Applications. IEEE Computer, pages 19-30, Aug. 1984.
 
39
K. Seshan, T. Maloney, and K. Wu. The Quality and Reliability of Intel's Quarter Micron Process. Intel Technology Journal, Sept. 1998.
 
40
L. Spainhower and T. A. Gregg. IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective. IBM Journal of Research and Development, 43(5/6), September/November 1999.
 
41
42
 
43
 
44
 
45
46
 
47
 
48
 
49

CITED BY  34

Collaborative Colleagues:
Daniel J. Sorin: colleagues
Milo M. K. Martin: colleagues
Mark D. Hill: colleagues
David A. Wood: colleagues