|
ABSTRACT
The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
 |
4
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Collective operations in application-level fault-tolerant MPI, Proceedings of the 17th annual international conference on Supercomputing, June 23-26, 2003, San Francisco, CA, USA
[doi> 10.1145/782814.782847]
|
 |
5
|
|
| |
6
|
|
| |
7
|
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct. 1996.
|
| |
8
|
|
| |
9
|
M. P. I. Forum. MPI-2: Extensions to the message-passing interface, July 18 1997. Available from http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html.
|
 |
10
|
Richard L. Graham , Sung-Eun Choi , David J. Daniel , Nehal N. Desai , Ronald G. Minnich , Craig E. Rasmussen , L. Dean Risinger , Mitchel W. Sukalski, A network-failure-tolerant message-passing system for terascale clusters, Proceedings of the 16th international conference on Supercomputing, June 22-26, 2002, New York, New York, USA
[doi> 10.1145/514191.514205]
|
 |
11
|
|
| |
12
|
IBM Research. Blue gene project overview. Online at http://www.research.ibm.com/bluegene/, 2002.
|
 |
13
|
|
| |
14
|
|
| |
15
|
J. B. M. Litzkow, T. Tannenbaum and M. Livny. Checkpoint and migration of UNIX processes in the condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison, 1997.
|
| |
16
|
National Nuclear Security Administration. Asci home. Online at http://www.nnsa.doe.gov/asc/, 2002.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
T. Tabe and Q. F. Stout. The use of the MPI communication library in the NAS parallel benchmarks. Technical Report CSE-TR-386-99, Advanced Computer Architecture Laboratory, Dept. of Electrical Engineering and Computer Science, University of Michigan, 17, 1999.
|
| |
21
|
NR Adiga , G Almasi , GS Almasi , Y Aridor , R Barik , D Beece , R Bellofatto , G Bhanot , R Bickford , M Blumrich , AA Bright , J Brunheroto , C Caşcaval , J Castaños , W Chan , L Ceze , P Coteus , S Chatterjee , D Chen , G Chiu , TM Cipolla , P Crumley , KM Desai , A Deutsch , T Domany , MB Dombrowa , W Donath , M Eleftheriou , C Erway , J Esch , B Fitch , J Gagliano , A Gara , R Garg , R Germain , ME Giampapa , B Gopalsamy , J Gunnels , M Gupta , F Gustavson , S Hall , RA Haring , D Heidel , P Heidelberger , LM Herger , D Hoenicke , RD Jackson , T Jamal-Eddine , GV Kopcsay , E Krevat , MP Kurhekar , AP Lanzetta , D Lieber , LK Liu , M Lu , M Mendell , A Misra , Y Moatti , L Mok , JE Moreira , BJ Nathanson , M Newton , M Ohmacht , A Oliner , V Pandit , RB Pudota , R Rand , R Regan , B Rubin , A Ruehli , S Rus , RK Sahoo , A Sanomiya , E Schenfeld , M Sharma , E Shmueli , S Singh , P Song , V Srinivasan , BD Steinmacher-Burow , K Strauss , C Surovic , R Swetz , T Takken , RB Tremaine , M Tsao , AR Umamaheshwaran , P Verma , P Vranas , TJC Ward , M Wazlowski , W Barrett , C Engel , B Drehmel , B Hilgart , D Hill , F Kasemkhani , D Krolak , CT Li , T Liebsch , J Marcella , A Muff , A Okomo , M Rouse , A Schram , M Tubbs , G Ulsh , C Wait , J Wittrup , M Bae , K Dockser , L Kissel , MK Seager , JS Vetter , K Yates, An overview of the BlueGene/L Supercomputer, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, p.1-22, November 16, 2002, Baltimore, Maryland
|
CITED BY 23
|
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Collective operations in application-level fault-tolerant MPI, Proceedings of the 17th annual international conference on Supercomputing, June 23-26, 2003, San Francisco, CA, USA
|
|
|
|
|
|
|
|
|
Raphael Y. de Camargo , Andrei Goldchleger , Fabio Kon , Alfredo Goldman, Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware, Proceedings of the 2nd workshop on Middleware for grid computing, p.35-40, October 18-22, 2004, Toronto, Ontario, Canada
|
|
|
Alison N. Norman , Sung-Eun Choi , Calvin Lin, Compiler-generated staggered checkpointing, Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems, p.1-8, October 22-23, 2004, Houston, Texas
|
|
|
|
|
|
|
|
|
|
|
|
Jyothish Varma , Chao Wang , Frank Mueller , Christian Engelmann , Stephen L. Scott, Scalable, fault tolerant membership for MPI tasks on HPC systems, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
|
|
|
|
|
|
|
|
|
Camille Coti , Thomas Herault , Pierre Lemarinier , Laurence Pilard , Ala Rezmerita , Eric Rodriguez , Franck Cappello, MPI tools and performance studies---Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, November 11-17, 2006, Tampa, Florida
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Martin Schulz , Greg Bronevetsky , Rohit Fernandes , Daniel Marques , Keshav Pingali , Paul Stodghill, Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs, Proceedings of the 2004 ACM/IEEE conference on Supercomputing, p.38, November 06-12, 2004
|
|
|
Lei Gao , Stefan Kraemer , Rainer Leupers , Gerd Ascheid , Heinrich Meyr, A fast and generic hybrid simulation approach using C virtual machine, Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems, September 30-October 03, 2007, Salzburg, Austria
|
|
|
Panfeng Wang , Xuejun Yang , Hongyi Fu , Yunfei Du , Zhiyun Wang , Jia Jia, Automated application-level checkpointing based on live-variable analysis in MPI programs, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, February 20-23, 2008, Salt Lake City, UT, USA
|
|
|
Stefan Kraemer , Lei Gao , Jan Weinstock , Rainer Leupers , Gerd Ascheid , Heinrich Meyr, HySim: a fast simulation framework for embedded software development, Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis, September 30-October 03, 2007, Salzburg, Austria
|
|
|
|
|
|
Alejandro Duran , Roger Ferrer , Juan José Costa , Marc Gonzàz , Xavier Martorell , Eduard Ayguadé , Jesú Labarta, A proposal for error handling in OpenMP, International Journal of Parallel Programming, v.35 n.4, p.393-416, August 2007
|
|
|
|
|