ACM Home Page
Please provide us with feedback. Feedback
Automated application-level checkpointing of MPI programs
Full text PdfPdf (131 KB)
Source Principles and Practice of Parallel Programming archive
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming table of contents
San Diego, California, USA
SESSION: Checkpointing and communication table of contents
Pages: 84 - 94  
Year of Publication: 2003
ISBN:1-58113-588-2
Also published in ...
Authors
Greg Bronevetsky  Cornell University, Ithaca, NY
Daniel Marques  Cornell University, Ithaca, NY
Keshav Pingali  Cornell University, Ithaca, NY
Paul Stodghill  Cornell University, Ithaca, NY
Sponsors
SIGPLAN: ACM Special Interest Group on Programming Languages
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 92,   Citation Count: 23
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/781498.781513
What is a DOI?

ABSTRACT

The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
4
5
 
6
 
7
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct. 1996.
 
8
 
9
M. P. I. Forum. MPI-2: Extensions to the message-passing interface, July 18 1997. Available from http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html.
10
11
 
12
IBM Research. Blue gene project overview. Online at http://www.research.ibm.com/bluegene/, 2002.
13
 
14
 
15
J. B. M. Litzkow, T. Tannenbaum and M. Livny. Checkpoint and migration of UNIX processes in the condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison, 1997.
 
16
National Nuclear Security Administration. Asci home. Online at http://www.nnsa.doe.gov/asc/, 2002.
 
17
 
18
 
19
 
20
T. Tabe and Q. F. Stout. The use of the MPI communication library in the NAS parallel benchmarks. Technical Report CSE-TR-386-99, Advanced Computer Architecture Laboratory, Dept. of Electrical Engineering and Computer Science, University of Michigan, 17, 1999.
 
21

CITED BY  23

Collaborative Colleagues:
Greg Bronevetsky: colleagues
Daniel Marques: colleagues
Keshav Pingali: colleagues
Paul Stodghill: colleagues