ACM Home Page
Please provide us with feedback. Feedback
Interconnect agnostic checkpoint/restart in open MPI
Full text PdfPdf (839 KB)
Source
High Performance Distributed Computing archive
Proceedings of the 18th ACM international symposium on High performance distributed computing table of contents
Garching, Germany
SESSION: I/O and parallel computing table of contents
Pages 49-58  
Year of Publication: 2009
ISBN:978-1-60558-587-1
Authors
Joshua Hursey  Indiana University, Bloomington, IN, USA
Timothy I. Mattox  Indiana University, Bloomington, IN, USA
Andrew Lumsdaine  Indiana University, Bloomington, IN, USA
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 34,   Downloads (12 Months): 108,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1551609.1551619
What is a DOI?

ABSTRACT

Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passing Interface (MPI) level transparent checkpoint/restart fault tolerance is an appealing option to HPC application developers that do not wish to restructure their code. Historically, MPI implementations that provided this option have struggled to provide a full range of interconnect support, especially shared memory support. This paper presents a new approach for implementing checkpoint/restart coordination algorithms that allows the MPI implementation of checkpoint/restart to be interconnect agnostic. This approach allows an application to be checkpointed on one set of interconnects (e.g., InfiniBand and shared memory) and be restarted with a different set of interconnects (e.g., Myrinet and shared memory or Ethernet). By separating the network interconnect details from the checkpoint/restart coordination algorithm we allow the HPC application to respond to changes in the cluster environment such as interconnect unavailability due to switch failure, re-load balance on an existing machine, or migrate to a different machine with a different set of interconnects. We present results characterizing the performance impact of this approach on HPC applications.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello. MPICH-V project: A multiprotocol automatic fault-tolerant MPI. In International Journal of High performance Computing Applications, volume 20, pages 319--333. Sage Publications, Inc., 2006.
4
 
5
G. Burns, R. Daoud, and J. Vaigl. LAM: An Open Cluster Environment for MPI. In Proceedings of Supercomputing Symposium, pages 379--386, 1994.
6
7
 
8
J. Duell, P. Hargrove, and E. Roman. The design and implementation of Berkeley Lab's linux checkpoint/restart. Technical Report LBNL-54941, Lawrence Berkeley National Lab, 2003.
9
 
10
Future Technologies Group. Berkeley Lab Checkpoint/Restart (BLCR). http://ftg.lbl.gov/checkpoint/.
 
11
 
12
 
13
E. Garbriel, G. Fagg, G. Bosilica, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. Castain, D. Daniel, R. Graham, and T. Woodall. Open MPI: goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, 2004.
 
14
C. Huang. System support for checkpoint and restart of Charm++ and AMPI applications. Master's thesis, Dept. of Computer Science, University of Illinois, 2004.
 
15
C. Huang, G. Zheng, and L. V. Kal'e. Supporting adaptivity in MPI for dynamic parallel applications. Technical Report 07-08, Parallel Programming Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign, 2007.
16
 
17
J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, March 2007.
 
18
InfiniBand Trade Association. InfiniBand. http://www.infinibandta.org.
 
19
D. P. Jasper. A discussion of checkpoint/restart. Software Age, pages 9--14, October 1969.
 
20
 
21
22
 
23
National Aeronautics and Space Administration. NAS parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.
 
24
OpenFabrics Alliance. OpenFabrics. http://www.openfabrics.org/.
 
25
26
27
 
28
S. Sankaran, J. M. Squyres, B. Barrett, and A. Lumsdaine. Checkpoint-restart support system services interface (SSI) modules for LAM/MPI. Technical Report TR578, Indiana University, Computer Science Department, 2003.
 
29
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4):479--493, Winter 2005.
 
30
C. Sosa. IBM system Blue Gene solution: Blue Gene/P application development. Technical report, IBM, September 2008.
 
31
J. M. Squyres and A. Lumsdaine. The component architecture of Open MPI: Enabling third-party collective algorithms. In Proceedings of 18th ACM International Conference on Supercomputing, Workshop on Component Models and Systems for Grid Applications, pages 167--185, St. Malo, France, July 2004. Springer.
 
32
33
 
34
 
35
D. Van Der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark, and H. J. Berendsen. GROMACS: Fast, flexable, and free. Journal of Computational Chemistry, 26(16):1701--1718, 2005.
36

Collaborative Colleagues:
Joshua Hursey: colleagues
Timothy I. Mattox: colleagues
Andrew Lumsdaine: colleagues