ACM Home Page
Please provide us with feedback. Feedback
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs
Full text PdfPdf (183 KB)
Source Conference on High Performance Networking and Computing archive
Proceedings of the 2004 ACM/IEEE conference on Supercomputing table of contents
Page: 38  
Year of Publication: 2004
ISBN:0-7695-2153-3
Authors
Martin Schulz  Lawrence Livermore National Laboratory
Greg Bronevetsky  Cornell University
Rohit Fernandes  Cornell University
Daniel Marques  Cornell University
Keshav Pingali  Cornell University
Paul Stodghill  Cornell University
Sponsor
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
IEEE Computer Society  Washington, DC, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 58,   Citation Count: 5
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: 10.1109/SC.2004.29

ABSTRACT

The running times of many computational science applications are much longer than the mean-time-to-failure of current high-performance computing platforms. To run to completion, such applications must tolerate hardware failures. Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this - the state of the computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR schemes in the literature can be classified as system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform; in addition, it cannot be used if there are no global barriers in the program. We are exploring an alternative called application-level, non-blocking checkpointing. In our approach, programs are transformed by a pre-processor so that they become self-checkpointing and self-restartable on any platform; there is also no assumption about the existence of global barriers in the code. In this paper, we describe our implementation of application-level, non-blocking checkpointing. We present experimental results on both a Windows cluster and a Compaq Alpha cluster, which show that the overheads introduced by our approach are small.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
5
6
 
7
[7] B. Carnes. The smg2000 benchmark code. Available at http://www.llnl.gov/asci/purple/ benchmarks/limited/smg/September 19 2001.
8
 
9
[9] Condor. http://www.cs.wisc.edu/condor/manual.
 
10
 
11
[11] M. Elnozahy, L. Alvisi, Y.M. Wang, and D.B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct. 1996.
 
12
 
13
 
14
 
15
[15] C.-C.J. Li and W.K. Fuchs. Catch - compiler-assisted techniques for checkpointing. In 20th International Symposium on Fault Tolerant Computing, pages 74-81, 1990.
 
16
 
17
 
18
[18] J.B.M. Litzkow, T. Tannenbaum and M. Livny. Checkpoint and migration of UNIX processes in the condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison, 1997.
 
19
[19] K. Perumalla and R. Fujimoto. Source-code transformations for efficient reversibility. Technical Report GIT-CC-99-21, College of Computing, Georgia Tech, September 1999.
 
20
[20] A. Petitet, R.C. Whaley, J. Dongarra, and A. Cleary. Hpl - a portable implementation of the high-performance linpack benchmark for distributed-memory computers. Available at http://www.netlib.org/benchmark/hpl/.
 
21
 
22
 
23
 
24
[24] N. Stone, J. Kochmar, R. Reddy, J.R. Scott, J. Sommer field, and C. Vizino. A checkpoint and recovery system for the Pittsburgh Supercomputing Center Terascale Computing System. In Supercomputing, 2001. Available at http://www.psc.edu/publications/tech\ _reports/chkpt\_rcvry/ checkpoint-recovery-1.0.html
 
25
[25] S. Vadhiyar and J. Dongarra. Srs -a framework for developing malleable and migratable parallel software. Parallel Processing Letters, 13(2):291-312, June 2003.

Collaborative Colleagues:
Martin Schulz: colleagues
Greg Bronevetsky: colleagues
Rohit Fernandes: colleagues
Daniel Marques: colleagues
Keshav Pingali: colleagues
Paul Stodghill: colleagues