|
ABSTRACT
Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small recon guration overhead within the fault-tolerant layer.This paper contributes a scalable approach to recon gure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for recon guration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
The ASCI purple benchmarks. http://www.llnl.gov/asci/purple/benchmarks, 2002.
|
| |
4
|
N. Adiga and et al. An overview of the BlueGene/L supercomputer. In Supercomputing, Nov. 2002.
|
 |
5
|
|
| |
6
|
R. T. Aulwes, D. J. Daniel, N. N. Desai, R. L. Graham, L. D. Risinger, M. A. Taylor, T. S. Woodall, and M. W. Sukalski. Architecture of LA-MPI, a network-fault-tolerant MPI. In International Parallel and Distributed Processing Symposium, 2004.
|
| |
7
|
B. Barrett, J. M. Squyres, A. Lumsdaine, R. L. Graham, and G. Bosilca. Analysis of the component architecture overhead in Open MPI. In Proceedings, 12th European PVM/MPI Users' Group Meeting, Sorrento, Italy, September 2005.
|
 |
8
|
|
| |
9
|
|
| |
10
|
George Bosilca , Aurelien Bouteiller , Franck Cappello , Samir Djilali , Gilles Fedak , Cecile Germain , Thomas Herault , Pierre Lemarinier , Oleg Lodygensky , Frederic Magniette , Vincent Neri , Anton Selikhov, MPICH-V: toward a scalable fault tolerant MPI for volatile nodes, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, p.1-18, November 16, 2002, Baltimore, Maryland
|
 |
11
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Automated application-level checkpointing of MPI programs, Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 11-13, 2003, San Diego, California, USA
|
 |
12
|
Greg Bronevetsky , Daniel Marques , Keshav Pingali , Paul Stodghill, Collective operations in application-level fault-tolerant MPI, Proceedings of the 17th annual international conference on Supercomputing, June 23-26, 2003, San Francisco, CA, USA
[doi> 10.1145/782814.782847]
|
| |
13
|
S. Chakravorty, C. Mendes, and L. Kale. Proactive fault tolerance in large systems. In HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of the 11th International Symposium on High Performance Computer Architecture (HPCA-11). IEEE Computer Society, 2005.
|
| |
14
|
G. V. Chockler, I. Keidar, and R. Vitenberg. Group communication speci cations: A comprehensive study, Apr. 23 2001.
|
| |
15
|
F. Cristian. Reaching agreement on processor group membership in synchronous distributed systems, June 12 1991.
|
 |
16
|
David E. Culler , Richard M. Karp , David Patterson , Abhijit Sahay , Eunice E. Santos , Klaus Erik Schauser , Ramesh Subramonian , Thorsten von Eicken, LogP: a practical model of parallel computation, Communications of the ACM, v.39 n.11, p.78-85, Nov. 1996
[doi> 10.1145/240455.240477]
|
 |
17
|
David Culler , Richard Karp , David Patterson , Abhijit Sahay , Klaus Erik Schauser , Eunice Santos , Ramesh Subramonian , Thorsten von Eicken, LogP: towards a realistic model of parallel computation, Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.1-12, May 19-22, 1993, San Diego, California, United States
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
J. Duell. The design and implementation of berkeley lab's linux checkpoint/restart. Tr, Lawrence Berkeley National Laboratory, 2000.
|
| |
23
|
J. Duell, P. H. Hargrove, and E. S. Roman. Requirements for linux checkpoint/restart, May 20 2002.
|
| |
24
|
|
| |
25
|
|
| |
26
|
R. Friedman and R. van Renesse. Strong and weak virtual synchrony in horus. Technical Report TR95--1537, Cornell University, Computer Science Department, Aug. 24, 1995.
|
| |
27
|
E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, pages 97 (104, Budapest, Hungary, September 2004.
|
| |
28
|
|
| |
29
|
IBM T. J. Watson. Personal communications. July 2005.
|
| |
30
|
I. Keidar. Group communication, June 12 2000.
|
| |
31
|
|
 |
32
|
|
| |
33
|
|
| |
34
|
D. Malki, D. Dolev, and R. Strong. A framework for partitionable membership service, Aug. 19 1995.
|
| |
35
|
S. McCanne and S. Floyd. VINT Network Simulator - ns (version 2). http://www-mash.CS.Berkeley.EDU/ns/, Apr. 1999.
|
| |
36
|
S. Mishra, L. L. Peterson, and R. D. Schlichting. Consul: a communication substrate for fault-tolerant distributed programs. Distributed Systems Engineering, 1(2):87 (103, Dec. 1993.
|
| |
37
|
|
| |
38
|
|
| |
39
|
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In Proceedings, LACSI Symposium, Sante Fe, New Mexico, USA, October 2003.
|
| |
40
|
J. M. Squyres and A. Lumsdaine. A Component Architecture for LAM/MPI. In Proceedings, 10th European PVM/MPI Users' Group Meeting, number 2840 in Lecture Notes in Computer Science, pages 379 (387, Venice, Italy, September / October 2003. Springer-Verlag.
|
| |
41
|
|
| |
42
|
S. Toueg and T. D. Chandra. Unreliable failure detectors for reliable distributed systems, June 18 1996.
|
| |
43
|
T. Woodall, R. Graham, R. Castain, D. Daniel, M. Sukalski, G. Fagg, E. Gabriel, G. Bosilca, T. Angskun, J. Dongarra, J. Squyres, V. Sahay, P. Kambadur, B. Barrett, and A. Lumsdaine. Open MPI's TEG point-to-point communications methodology: Comparison to existing implementations. In Proceedings, 11th European PVM/MPI Users' Group Meeting, pages 105 (111, Budapest, Hungary, September 2004.
|
| |
44
|
T. Woodall, R. Graham, R. Castain, D. Daniel, M. Sukalski, G. Fagg, E. Gabriel, G. Bosilca, T. Angskun, J. Dongarra, J. Squyres, V. Sahay, P. Kambadur, B. Barrett, and A. Lumsdaine. TEG: A high-performance, scalable, multi-network point-to-point communications methodology. In Proceedings, 11th European PVM/MPI Users' Group Meeting, pages 303 (310, Budapest, Hungary, September 2004.
|
| |
45
|
T. Yang, J. Zhou, and L. Chu. An ef cient topology-adaptive membership protocol for large-scale network services. Technical report, University of California, Santa Barbara, Computer Science, June 2004.
|
| |
46
|
|
|