ACM Home Page
Please provide us with feedback. Feedback
Collective operations in application-level fault-tolerant MPI
Full text PdfPdf (206 KB)
Source International Conference on Supercomputing archive
Proceedings of the 17th annual international conference on Supercomputing table of contents
San Francisco, CA, USA
SESSION: Fault tolerance table of contents
Pages: 234 - 243  
Year of Publication: 2003
ISBN:1-58113-733-8
Authors
Greg Bronevetsky  Cornell University, Ithaca, NY
Daniel Marques  Cornell University, Ithaca, NY
Keshav Pingali  Cornell University, Ithaca, NY
Paul Stodghill  Cornell University, Ithaca, NY
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 41,   Citation Count: 9
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/782814.782847
What is a DOI?

ABSTRACT

Fault-tolerance is becoming a critical issue on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs without global barriers.In an earlier paper, we presented a distributed checkpoint coordination protocol which handles MPI's point-to-point constructs, while dealing with the unique challenges of application-level checkpointing. The protocol is implemented by a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. However, it did not handle collective communication, which is a very important part of MPI. In this paper, we extend the protocol to handle MPI's collective communication constructs. We also present experimental results that show that the overhead introduced by the protocol for collective operations is small.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
D. Cameron and G. Regnier. The Virtual Interface Architecture. Intel Press, San Francisco, California, first edition, 2002.
4
 
5
Cornell Theory Center. Online at http://www.tc.cornell.edu/, 2003.
 
6
 
7
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct. 1996.
 
8
Emulex corporation. Overview of Giganet cLAN. Online at http://www.emulex.com/ts/legacy/clan/index.htm, 2003.
 
9
M. P. I. Forum. Overview of the MPI standard. Online at http://www.mpi-forum.org/, 2003.
 
10
J. B. M. Litzkow, T. Tannenbaum and M. Livny. Checkpoint and migration of UNIX processes in the Condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison, 1997.
 
11
National Nuclear Security Administration. ASCI home. Online at http://www.nnsa.doe.gov/asc/, 2002.
 
12
OpenMP. Overview of the OpenMP standard. Online at http://www.openmp.org/, 2003.
 
13
 
14
 
15

CITED BY  9

Collaborative Colleagues:
Greg Bronevetsky: colleagues
Daniel Marques: colleagues
Keshav Pingali: colleagues
Paul Stodghill: colleagues