|
ABSTRACT
Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Cristiana Amza , Alan L. Cox , Sandhya Dwarkadas , Pete Keleher , Honghui Lu , Ramakrishnan Rajamony , Weimin Yu , Willy Zwaenepoel, TreadMarks: Shared Memory Computing on Networks of Workstations, Computer, v.29 n.2, p.18-28, February 1996
[doi> 10.1109/2.485843]
|
| |
3
|
|
 |
4
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Collective operations in application-level fault-tolerant MPI, Proceedings of the 17th annual international conference on Supercomputing, June 23-26, 2003, San Francisco, CA, USA
[doi> 10.1145/782814.782847]
|
 |
5
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Automated application-level checkpointing of MPI programs, Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 11-13, 2003, San Diego, California, USA
|
 |
6
|
|
| |
7
|
|
| |
8
|
Condor.http://www.cs.wisc.edu/condor/manual.
|
| |
9
|
|
| |
10
|
J. Duell. The Design and Implementation of Berkeley Lab's Linux Checkp int/Restart. http://www.nersc.gov/research/FTG/checkpoint/rep rts.html.
|
| |
11
|
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96--181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, October 1996.
|
| |
12
|
P. Guedes and M. Castro. Distributed shared object memory. In Proceedings of WWOS 1993.
|
| |
13
|
|
| |
14
|
T. Tannenbaum J. B. M. Litzkow and M. Livny. Checkpoint and Migration of Unix Processes in the Condor Distributed Processing System. Technical Report Technical Report 1346, University of Wisconsin-Madison, 1997.
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
Y. M. Wang M. Elnozahy, L. Alvisi and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.
|
 |
19
|
|
| |
20
|
K. Kusan M. Sato, S. Satoh and Y. Tanaka. Design of OpenMP compiler for an SMP cluster. In EWOMP '99 pages 32--39, September 1999.
|
| |
21
|
Message Passing Interface Forum (MPIF). MPI: A message-passing interface standard. Technical Report, University of Tennessee, Knoxville, June 1995.
|
| |
22
|
N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfeld, C. Vizino. A checkpoint and recovery system for the pittsburgh supercomputing center terascale computing system. http://www.psc.edu/publications/tech_reports/chkp_rcvry/checkpoint-recovery-1.0.html.
|
 |
23
|
Nuno Neves , Miguel Castro , Paulo Guedes, A checkpoint protocol for an entry consistent shared memory system, Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing, p.121-129, August 14-17, 1994, Los Angeles, California, United States
[doi> 10.1145/197917.197973]
|
| |
24
|
OpenMP Architecture Review Board. OpenMP C and C++ Application, Program Interface Version 1.0, Document Number 004-2229-01 edition, October 1998. Available from http://www.openmp.org/.
|
 |
25
|
Daniel J. Sorin , Milo M. K. Martin , Mark D. Hill , David A. Wood, SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery, Proceedings of the 29th annual international symposium on Computer architecture, p.123, May 25-29, 2002, Anchorage, Alaska
|
| |
26
|
|
| |
27
|
|
| |
28
|
Florin Sultan , Liviu Iftode , Thu Nguyen, Scalable fault-tolerant distributed shared memory, Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), p.20-es, November 04-10, 2000, Dallas, Texas, United States
|
 |
29
|
Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, The SPLASH-2 programs: characterization and methodological considerations, Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy
|
|