ACM Home Page
Please provide us with feedback. Feedback
Scalable, fault tolerant membership for MPI tasks on HPC systems
Full text PdfPdf (457 KB)
Source International Conference on Supercomputing archive
Proceedings of the 20th annual international conference on Supercomputing table of contents
Cairns, Queensland, Australia
SESSION: Multicore interconnection/communication table of contents
Pages: 219 - 228  
Year of Publication: 2006
ISBN:1-59593-282-8
Authors
Jyothish Varma  North Carolina State University, Raleigh, NC
Chao Wang  North Carolina State University, Raleigh, NC
Frank Mueller  North Carolina State University, Raleigh, NC
Christian Engelmann  Oak Ridge National Laboratory, Oak Ridge, TN
Stephen L. Scott  Oak Ridge National Laboratory, Oak Ridge, TN
Sponsors
SIGARCH: ACM Special Interest Group on Computer Architecture
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 52,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183401.1183433
What is a DOI?

ABSTRACT

Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small recon guration overhead within the fault-tolerant layer.This paper contributes a scalable approach to recon gure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for recon guration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
The ASCI purple benchmarks. http://www.llnl.gov/asci/purple/benchmarks, 2002.
 
4
N. Adiga and et al. An overview of the BlueGene/L supercomputer. In Supercomputing, Nov. 2002.
5
 
6
R. T. Aulwes, D. J. Daniel, N. N. Desai, R. L. Graham, L. D. Risinger, M. A. Taylor, T. S. Woodall, and M. W. Sukalski. Architecture of LA-MPI, a network-fault-tolerant MPI. In International Parallel and Distributed Processing Symposium, 2004.
 
7
B. Barrett, J. M. Squyres, A. Lumsdaine, R. L. Graham, and G. Bosilca. Analysis of the component architecture overhead in Open MPI. In Proceedings, 12th European PVM/MPI Users' Group Meeting, Sorrento, Italy, September 2005.
8
 
9
 
10
11
12
 
13
S. Chakravorty, C. Mendes, and L. Kale. Proactive fault tolerance in large systems. In HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11). IEEE Computer Society, 2005.
 
14
G. V. Chockler, I. Keidar, and R. Vitenberg. Group communication speci cations: A comprehensive study, Apr. 23 2001.
 
15
F. Cristian. Reaching agreement on processor group membership in synchronous distributed systems, June 12 1991.
16
17
18
 
19
 
20
 
21
 
22
J. Duell. The design and implementation of berkeley lab's linux checkpoint/restart. Tr, Lawrence Berkeley National Laboratory, 2000.
 
23
J. Duell, P. H. Hargrove, and E. S. Roman. Requirements for linux checkpoint/restart, May 20 2002.
 
24
 
25
 
26
R. Friedman and R. van Renesse. Strong and weak virtual synchrony in horus. Technical Report TR95--1537, Cornell University, Computer Science Department, Aug. 24, 1995.
 
27
E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, pages 97 (104, Budapest, Hungary, September 2004.
 
28
 
29
IBM T. J. Watson. Personal communications. July 2005.
 
30
I. Keidar. Group communication, June 12 2000.
 
31
32
 
33
 
34
D. Malki, D. Dolev, and R. Strong. A framework for partitionable membership service, Aug. 19 1995.
 
35
S. McCanne and S. Floyd. VINT Network Simulator - ns (version 2). http://www-mash.CS.Berkeley.EDU/ns/, Apr. 1999.
 
36
S. Mishra, L. L. Peterson, and R. D. Schlichting. Consul: a communication substrate for fault-tolerant distributed programs. Distributed Systems Engineering, 1(2):87 (103, Dec. 1993.
 
37
 
38
 
39
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In Proceedings, LACSI Symposium, Sante Fe, New Mexico, USA, October 2003.
 
40
J. M. Squyres and A. Lumsdaine. A Component Architecture for LAM/MPI. In Proceedings, 10th European PVM/MPI Users' Group Meeting, number 2840 in Lecture Notes in Computer Science, pages 379 (387, Venice, Italy, September / October 2003. Springer-Verlag.
 
41
 
42
S. Toueg and T. D. Chandra. Unreliable failure detectors for reliable distributed systems, June 18 1996.
 
43
T. Woodall, R. Graham, R. Castain, D. Daniel, M. Sukalski, G. Fagg, E. Gabriel, G. Bosilca, T. Angskun, J. Dongarra, J. Squyres, V. Sahay, P. Kambadur, B. Barrett, and A. Lumsdaine. Open MPI's TEG point-to-point communications methodology: Comparison to existing implementations. In Proceedings, 11th European PVM/MPI Users' Group Meeting, pages 105 (111, Budapest, Hungary, September 2004.
 
44
T. Woodall, R. Graham, R. Castain, D. Daniel, M. Sukalski, G. Fagg, E. Gabriel, G. Bosilca, T. Angskun, J. Dongarra, J. Squyres, V. Sahay, P. Kambadur, B. Barrett, and A. Lumsdaine. TEG: A high-performance, scalable, multi-network point-to-point communications methodology. In Proceedings, 11th European PVM/MPI Users' Group Meeting, pages 303 (310, Budapest, Hungary, September 2004.
 
45
T. Yang, J. Zhou, and L. Chu. An ef cient topology-adaptive membership protocol for large-scale network services. Technical report, University of California, Santa Barbara, Computer Science, June 2004.
 
46

Collaborative Colleagues:
Jyothish Varma: colleagues
Chao Wang: colleagues
Frank Mueller: colleagues
Christian Engelmann: colleagues
Stephen L. Scott: colleagues