|
ABSTRACT
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passing Interface (MPI) level transparent checkpoint/restart fault tolerance is an appealing option to HPC application developers that do not wish to restructure their code. Historically, MPI implementations that provided this option have struggled to provide a full range of interconnect support, especially shared memory support. This paper presents a new approach for implementing checkpoint/restart coordination algorithms that allows the MPI implementation of checkpoint/restart to be interconnect agnostic. This approach allows an application to be checkpointed on one set of interconnects (e.g., InfiniBand and shared memory) and be restarted with a different set of interconnects (e.g., Myrinet and shared memory or Ethernet). By separating the network interconnect details from the checkpoint/restart coordination algorithm we allow the HPC application to respond to changes in the cluster environment such as interconnect unavailability due to switch failure, re-load balance on an existing machine, or migrate to a different machine with a different set of interconnects. We present results characterizing the performance impact of this approach on HPC applications.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello. MPICH-V project: A multiprotocol automatic fault-tolerant MPI. In International Journal of High performance Computing Applications, volume 20, pages 319--333. Sage Publications, Inc., 2006.
|
 |
4
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Automated application-level checkpointing of MPI programs, Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 11-13, 2003, San Diego, California, USA
|
| |
5
|
G. Burns, R. Daoud, and J. Vaigl. LAM: An Open Cluster Environment for MPI. In Proceedings of Supercomputing Symposium, pages 379--386, 1994.
|
 |
6
|
|
 |
7
|
Camille Coti , Thomas Herault , Pierre Lemarinier , Laurence Pilard , Ala Rezmerita , Eric Rodriguez , Franck Cappello, Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, November 11-17, 2006, Tampa, Florida
[doi> 10.1145/1188455.1188587]
|
| |
8
|
J. Duell, P. Hargrove, and E. Roman. The design and implementation of Berkeley Lab's linux checkpoint/restart. Technical Report LBNL-54941, Lawrence Berkeley National Lab, 2003.
|
 |
9
|
|
| |
10
|
Future Technologies Group. Berkeley Lab Checkpoint/Restart (BLCR). http://ftg.lbl.gov/checkpoint/.
|
| |
11
|
|
| |
12
|
|
| |
13
|
E. Garbriel, G. Fagg, G. Bosilica, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. Castain, D. Daniel, R. Graham, and T. Woodall. Open MPI: goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, 2004.
|
| |
14
|
C. Huang. System support for checkpoint and restart of Charm++ and AMPI applications. Master's thesis, Dept. of Computer Science, University of Illinois, 2004.
|
| |
15
|
C. Huang, G. Zheng, and L. V. Kal'e. Supporting adaptivity in MPI for dynamic parallel applications. Technical Report 07-08, Parallel Programming Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign, 2007.
|
 |
16
|
Chao Huang , Gengbin Zheng , Laxmikant Kalé , Sameer Kumar, Performance evaluation of adaptive MPI, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, March 29-31, 2006, New York, New York, USA
[doi> 10.1145/1122971.1122976]
|
| |
17
|
J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, March 2007.
|
| |
18
|
InfiniBand Trade Association. InfiniBand. http://www.infinibandta.org.
|
| |
19
|
D. P. Jasper. A discussion of checkpoint/restart. Software Age, pages 9--14, October 1969.
|
| |
20
|
Hyungsoo Jung , Dongin Shin , Hyuck Han , Jai W. Kim , Heon Y. Yeom , Jongsuk Lee, Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3), Proceedings of the 2005 ACM/IEEE conference on Supercomputing, p.32, November 12-18, 2005
[doi> 10.1109/SC.2005.22]
|
| |
21
|
|
 |
22
|
|
| |
23
|
National Aeronautics and Space Administration. NAS parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.
|
| |
24
|
OpenFabrics Alliance. OpenFabrics. http://www.openfabrics.org/.
|
| |
25
|
|
 |
26
|
|
 |
27
|
|
| |
28
|
S. Sankaran, J. M. Squyres, B. Barrett, and A. Lumsdaine. Checkpoint-restart support system services interface (SSI) modules for LAM/MPI. Technical Report TR578, Indiana University, Computer Science Department, 2003.
|
| |
29
|
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4):479--493, Winter 2005.
|
| |
30
|
C. Sosa. IBM system Blue Gene solution: Blue Gene/P application development. Technical report, IBM, September 2008.
|
| |
31
|
J. M. Squyres and A. Lumsdaine. The component architecture of Open MPI: Enabling third-party collective algorithms. In Proceedings of 18th ACM International Conference on Supercomputing, Workshop on Component Models and Systems for Grid Applications, pages 167--185, St. Malo, France, July 2004. Springer.
|
| |
32
|
|
 |
33
|
|
| |
34
|
|
| |
35
|
D. Van Der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark, and H. J. Berendsen. GROMACS: Fast, flexable, and free. Journal of Computational Chemistry, 26(16):1701--1718, 2005.
|
 |
36
|
|
INDEX TERMS
Primary Classification:
D.
Software
D.4
OPERATING SYSTEMS
D.4.5
Reliability
Subjects:
Checkpoint/restart
Additional Classification:
D.
Software
D.2
SOFTWARE ENGINEERING
D.2.2
Design Tools and Techniques
Subjects:
Software libraries
D.4
OPERATING SYSTEMS
D.4.4
Communications Management
Subjects:
Network communication;
Message sending
D.4.5
Reliability
Subjects:
Fault-tolerance
General Terms:
Design,
Experimentation,
Performance,
Reliability
Keywords:
MPI,
checkpoint coordination protocol,
checkpoint/restart,
fault tolerance,
high speed interconnect,
infiniband,
myrinet,
rollback-recovery,
shared memory
|