ACM Home Page
Please provide us with feedback. Feedback
Application-level checkpointing for shared memory programs
Full text PdfPdf (236 KB)
Source Architectural Support for Programming Languages and Operating Systems archive
Proceedings of the 11th international conference on Architectural support for programming languages and operating systems table of contents
Boston, MA, USA
SESSION: Reliability table of contents
Pages: 235 - 247  
Year of Publication: 2004
ISBN:1-58113-804-0
Also published in ...
Authors
Greg Bronevetsky  Cornell University, Ithaca, NY
Daniel Marques  Cornell University, Ithaca, NY
Keshav Pingali  Cornell University, Ithaca, NY
Peter Szwed  Cornell University, Ithaca, NY
Martin Schulz  University of California, Livermore, CA
Sponsors
SIGPLAN: ACM Special Interest Group on Programming Languages
SIGOPS: ACM Special Interest Group on Operating Systems
SIGARCH: ACM Special Interest Group on Computer Architecture
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 114,   Citation Count: 9
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1024393.1024421
What is a DOI?

ABSTRACT

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
4
5
6
 
7
 
8
Condor.http://www.cs.wisc.edu/condor/manual.
 
9
 
10
J. Duell. The Design and Implementation of Berkeley Lab's Linux Checkp int/Restart. http://www.nersc.gov/research/FTG/checkpoint/rep rts.html.
 
11
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96--181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, October 1996.
 
12
P. Guedes and M. Castro. Distributed shared object memory. In Proceedings of WWOS 1993.
 
13
 
14
T. Tannenbaum J. B. M. Litzkow and M. Livny. Checkpoint and Migration of Unix Processes in the Condor Distributed Processing System. Technical Report Technical Report 1346, University of Wisconsin-Madison, 1997.
 
15
 
16
 
17
 
18
Y. M. Wang M. Elnozahy, L. Alvisi and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.
19
 
20
K. Kusan M. Sato, S. Satoh and Y. Tanaka. Design of OpenMP compiler for an SMP cluster. In EWOMP '99 pages 32--39, September 1999.
 
21
Message Passing Interface Forum (MPIF). MPI: A message-passing interface standard. Technical Report, University of Tennessee, Knoxville, June 1995.
 
22
N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfeld, C. Vizino. A checkpoint and recovery system for the pittsburgh supercomputing center terascale computing system. http://www.psc.edu/publications/tech_reports/chkp_rcvry/checkpoint-recovery-1.0.html.
23
 
24
OpenMP Architecture Review Board. OpenMP C and C++ Application, Program Interface Version 1.0, Document Number 004-2229-01 edition, October 1998. Available from http://www.openmp.org/.
25
 
26
 
27
 
28
29

CITED BY  9

Collaborative Colleagues:
Greg Bronevetsky: colleagues
Daniel Marques: colleagues
Keshav Pingali: colleagues
Peter Szwed: colleagues
Martin Schulz: colleagues