|
ABSTRACT
The running times of many computational science applications are much longer than the mean-time-to-failure of current high-performance computing platforms. To run to completion, such applications must tolerate hardware failures. Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this - the state of the computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR schemes in the literature can be classified as system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform; in addition, it cannot be used if there are no global barriers in the program. We are exploring an alternative called application-level, non-blocking checkpointing. In our approach, programs are transformed by a pre-processor so that they become self-checkpointing and self-restartable on any platform; there is also no assumption about the existence of global barriers in the code. In this paper, we describe our implementation of application-level, non-blocking checkpointing. We present experimental results on both a Windows cluster and a Compaq Alpha cluster, which show that the overheads introduced by our approach are small.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
Bouteiller Bouteiller , Franck Cappello , Thomas Herault , Krawezik Krawezik , Pierre Lemarinier , Magniette Magniette, MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p.25, November 15-21, 2003
|
 |
5
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Automated application-level checkpointing of MPI programs, Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 11-13, 2003, San Diego, California, USA
|
 |
6
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Collective operations in application-level fault-tolerant MPI, Proceedings of the 17th annual international conference on Supercomputing, June 23-26, 2003, San Francisco, CA, USA
[doi> 10.1145/782814.782847]
|
| |
7
|
[7] B. Carnes. The smg2000 benchmark code. Available at http://www.llnl.gov/asci/purple/ benchmarks/limited/smg/September 19 2001.
|
 |
8
|
|
| |
9
|
[9] Condor. http://www.cs.wisc.edu/condor/manual.
|
| |
10
|
|
| |
11
|
[11] M. Elnozahy, L. Alvisi, Y.M. Wang, and D.B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct. 1996.
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
[15] C.-C.J. Li and W.K. Fuchs. Catch - compiler-assisted techniques for checkpointing. In 20th International Symposium on Fault Tolerant Computing, pages 74-81, 1990.
|
| |
16
|
|
| |
17
|
|
| |
18
|
[18] J.B.M. Litzkow, T. Tannenbaum and M. Livny. Checkpoint and migration of UNIX processes in the condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison, 1997.
|
| |
19
|
[19] K. Perumalla and R. Fujimoto. Source-code transformations for efficient reversibility. Technical Report GIT-CC-99-21, College of Computing, Georgia Tech, September 1999.
|
| |
20
|
[20] A. Petitet, R.C. Whaley, J. Dongarra, and A. Cleary. Hpl - a portable implementation of the high-performance linpack benchmark for distributed-memory computers. Available at http://www.netlib.org/benchmark/hpl/.
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
[24] N. Stone, J. Kochmar, R. Reddy, J.R. Scott, J. Sommer field, and C. Vizino. A checkpoint and recovery system for the Pittsburgh Supercomputing Center Terascale Computing System. In Supercomputing, 2001. Available at http://www.psc.edu/publications/tech\ _reports/chkpt\_rcvry/ checkpoint-recovery-1.0.html
|
| |
25
|
[25] S. Vadhiyar and J. Dongarra. Srs -a framework for developing malleable and migratable parallel software. Parallel Processing Letters, 13(2):291-312, June 2003.
|
|