| Experimental evaluation of application-level checkpointing for OpenMP programs |
| Full text |
Pdf
(811 KB)
|
| Source
|
International Conference on Supercomputing
archive
Proceedings of the 20th annual international conference on Supercomputing
table of contents
Cairns, Queensland, Australia
SESSION: Checkpointing and speculation
table of contents
Pages: 2 - 13
Year of Publication: 2006
ISBN:1-59593-282-8
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 4, Downloads (12 Months): 51, Citation Count: 0
|
|
|
ABSTRACT
It is becoming important for long-running scientific applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the computation's state is saved periodically to disk. Upon failure the computation is restarted from the last saved state. The common CPR mechanism, called System-level Checkpointing (SLC), requires modifying the Operating System and the communication libraries to enable them to save the state of the entire parallel application. This approach is not portable since a checkpointer for one system rarely works on another. Application-level Checkpointing (ALC) is a portable alternative where the programmer manually modifies their program to enable CPR, a very labor-intensive task.We are investigating the use of compiler technology to instrument codes to embed the ability to tolerate faults into applications themselves, making them self-checkpointing and self-restarting on any platform. In [9] we described a general approach for checkpointing shared memory APIs at the application level. Since [9] applied to only a toy feature set common to most shared memory APIs, this paper shows the practicality of this approach by extending it to a specific popular shared memory API: OpenMP. We describe the challenges involved in providing automated ALC for OpenMP applications and experimentally validate this approach by showing detailed performance results for our implementation of this technique. Our experiments with the NAS OpenMP benchmarks [1] and the EPCC microbench-marks [21] show generally low overhead on three different architectures: Linux/IA64, Tru64/Alpha and Solaris/Sparc and highlight important lessons about the performance characteristics of this aproach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
|
| |
4
|
|
| |
5
|
OpenMP Architecture Review Board. OpenMP application program interface, version 2.5.
|
 |
6
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Automated application-level checkpointing of MPI programs, Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 11-13, 2003, San Diego, California, USA
|
 |
7
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Collective operations in application-level fault-tolerant MPI, Proceedings of the 17th annual international conference on Supercomputing, June 23-26, 2003, San Francisco, CA, USA
[doi> 10.1145/782814.782847]
|
| |
8
|
Greg Bronevetsky, Keshav Pingali, and Paul Stodghill. A protocol for application-level checkpointing of OpenMP programs. Technical report, Cornell Computer Science, 2005.
|
 |
9
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Peter Szwed , Martin Schulz, Application-level checkpointing for shared memory programs, Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, October 07-13, 2004, Boston, MA, USA
|
 |
10
|
|
| |
11
|
Condor. http://www.cs.wisc.edu/condor/manual.
|
| |
12
|
|
| |
13
|
J. Duell. The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart. http://www.nersc.gov/research/FTG/checkpoint/reports.html.
|
| |
14
|
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, October 1996.
|
 |
15
|
|
| |
16
|
T. Tannenbaum J. B. M. Litzkow and M. Livny. Checkpoint and Migration of Unix Processes in the Condor Distributed Processing System. Technical Report Technical Report 1346, University of Wisconsin-Madison, 1997.
|
| |
17
|
|
 |
18
|
|
 |
19
|
Nuno Neves , Miguel Castro , Paulo Guedes, A checkpoint protocol for an entry consistent shared memory system, Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing, p.121-129, August 14-17, 1994, Los Angeles, California, United States
[doi> 10.1145/197917.197973]
|
| |
20
|
Dan Quinlan. Rose: Compiler support for object-oriented frameworks. Conference on Parallel Compilers (CPC2000), 2000.
|
| |
21
|
Fiona J. L. Reid and J. Mark Bull. OpenMP microbenchmarks version 2.0. In European Workshop on OpenMP, 2004.
|
| |
22
|
Martin Schulz , Greg Bronevetsky , Rohit Fernandes , Daniel Marques , Keshav Pingali , Paul Stodghill, Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs, Proceedings of the 2004 ACM/IEEE conference on Supercomputing, p.38, November 06-12, 2004
[doi> 10.1109/SC.2004.29]
|
 |
23
|
Daniel J. Sorin , Milo M. K. Martin , Mark D. Hill , David A. Wood, SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery, Proceedings of the 29th annual international symposium on Computer architecture, p.123, May 25-29, 2002, Anchorage, Alaska
|
| |
24
|
|
|