| Collective operations in application-level fault-tolerant MPI |
| Full text |
Pdf
(206 KB)
|
| Source
|
International Conference on Supercomputing
archive
Proceedings of the 17th annual international conference on Supercomputing
table of contents
San Francisco, CA, USA
SESSION: Fault tolerance
table of contents
Pages: 234 - 243
Year of Publication: 2003
ISBN:1-58113-733-8
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 7, Downloads (12 Months): 41, Citation Count: 9
|
|
|
ABSTRACT
Fault-tolerance is becoming a critical issue on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs without global barriers.In an earlier paper, we presented a distributed checkpoint coordination protocol which handles MPI's point-to-point constructs, while dealing with the unique challenges of application-level checkpointing. The protocol is implemented by a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. However, it did not handle collective communication, which is a very important part of MPI. In this paper, we extend the protocol to handle MPI's collective communication constructs. We also present experimental results that show that the overhead introduced by the protocol for collective operations is small.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Automated application-level checkpointing of MPI programs, Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 11-13, 2003, San Diego, California, USA
|
| |
3
|
D. Cameron and G. Regnier. The Virtual Interface Architecture. Intel Press, San Francisco, California, first edition, 2002.
|
 |
4
|
|
| |
5
|
Cornell Theory Center. Online at http://www.tc.cornell.edu/, 2003.
|
| |
6
|
|
| |
7
|
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct. 1996.
|
| |
8
|
Emulex corporation. Overview of Giganet cLAN. Online at http://www.emulex.com/ts/legacy/clan/index.htm, 2003.
|
| |
9
|
M. P. I. Forum. Overview of the MPI standard. Online at http://www.mpi-forum.org/, 2003.
|
| |
10
|
J. B. M. Litzkow, T. Tannenbaum and M. Livny. Checkpoint and migration of UNIX processes in the Condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison, 1997.
|
| |
11
|
National Nuclear Security Administration. ASCI home. Online at http://www.nnsa.doe.gov/asc/, 2002.
|
| |
12
|
OpenMP. Overview of the OpenMP standard. Online at http://www.openmp.org/, 2003.
|
| |
13
|
|
| |
14
|
|
| |
15
|
|
CITED BY 9
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jyothish Varma , Chao Wang , Frank Mueller , Christian Engelmann , Stephen L. Scott, Scalable, fault tolerant membership for MPI tasks on HPC systems, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
|
|
|
|
|
|
|
|
|
Martin Schulz , Greg Bronevetsky , Rohit Fernandes , Daniel Marques , Keshav Pingali , Paul Stodghill, Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs, Proceedings of the 2004 ACM/IEEE conference on Supercomputing, p.38, November 06-12, 2004
|
|
|
Stefan Kraemer , Lei Gao , Jan Weinstock , Rainer Leupers , Gerd Ascheid , Heinrich Meyr, HySim: a fast simulation framework for embedded software development, Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis, September 30-October 03, 2007, Salzburg, Austria
|
|