|
ABSTRACT
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
"Advanced configuration & power interface," http://www.acpi.info.
|
| |
2
|
"Readable dirty-bits for IA64 linux," https://www.gelato.unsw.edu.au/archives/gelato-technical/2005-November/001080.html.
|
| |
3
|
R. T. Aulwes, D. J. Daniel, N. N. Desai, R. L. Graham, L. D. Risinger, M. A. Taylor, T. Woodall, and M. Sukalski, "Architecture of LA-MPI, a network-fault-tolerant MPI," in IPDPS, 2004.
|
| |
4
|
A. Barak and R. Wheeler, "MOSIX: An integrated multiprocessor UNIX," in Proceedings of the Winter 1989 USENIX Conference. Berkeley, CA, USA: USENIX, 1989, pp. 101--112.
|
| |
5
|
George Bosilca , Aurelien Bouteiller , Franck Cappello , Samir Djilali , Gilles Fedak , Cecile Germain , Thomas Herault , Pierre Lemarinier , Oleg Lodygensky , Frederic Magniette , Vincent Neri , Anton Selikhov, MPICH-V: toward a scalable fault tolerant MPI for volatile nodes, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, p.1-18, November 16, 2002, Baltimore, Maryland
|
| |
6
|
|
| |
7
|
S. Chakravorty, C. Mendes, and L. Kale, "Proactive fault tolerance in MPI applications via task migration," in HiPC, 2006.
|
| |
8
|
S. Chakravorty, C. Mendes, and L. Kale, "Proactive fault tolerance in large systems," in HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of HPCA-11, 2005.
|
| |
9
|
S. Chakravorty, C. Mendes, and L. Kale, "A fault tolerance protocol with fast fault recovery," in IPDPS, 2007.
|
| |
10
|
Christopher Clark , Keir Fraser , Steven Hand , Jacob Gorm Hansen , Eric Jul , Christian Limpach , Ian Pratt , Andrew Warfield, Live migration of virtual machines, Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation, p.273-286, May 02-04, 2005
|
| |
11
|
|
| |
12
|
|
| |
13
|
C. Du, X.-H. Sun, and K. Chanchio, "HPCM: A pre-compiler aided middleware for the mobility of legacy code," in IEEE Cluster, 2003.
|
| |
14
|
|
| |
15
|
J. Duell, "The design and implementation of berkeley lab's linux checkpoint/restart," Lawrence Berkeley National Laboratory, TR, 2000.
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
J. Hursey, J. M. Squyres, and A. Lumsdaine, "A checkpoint and restart service specification for Open MPI," Indiana University, Computer Science Department, Technical Report, 2006.
|
| |
25
|
J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine, "The design and implementation of checkpoint/restart process fault tolerance for Open MPI," in DPDNS, Mar. 2007.
|
 |
26
|
|
| |
27
|
|
| |
28
|
M. Litzkow, "Remote unix - turning idle workstations into cycle servers," in Usenix Summer Conference, 1987, pp. 381--384.
|
| |
29
|
M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, "Checkpoint and migration of UNIX processes in the Condor distributed processing system," University of Wisconsin - Madison Computer Sciences Department, Tech. Rep. UW-CS-TR-1346, April 1997.
|
| |
30
|
Jiuxing Liu , Wei Huang , Bulent Abali , Dhabaleswar K. Panda, High performance VMM-bypass I/O in virtual machines, Proceedings of the annual conference on USENIX '06 Annual Technical Conference, p.3-3, May 30-June 03, 2006, Boston, MA
|
| |
31
|
|
 |
32
|
|
 |
33
|
|
| |
34
|
A. Oliner, R. Sahoo, J. Moreira, M. Gupta, and A. Sivasubramaniam, "Fault-aware job scheduling for BlueGene/L systems," in IPDPS, 2004.
|
| |
35
|
I. Philp, "Software failures and the road to a petaflop machine," in HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of HPCA-11. IEEE Computer Society, 2005.
|
| |
36
|
James S. Plank , Micah Beck , Gerry Kingsley , Kai Li, Libckpt: transparent checkpointing under Unix, Proceedings of the USENIX 1995 Technical Conference Proceedings on USENIX 1995 Technical Conference Proceedings, p.18-18, January 16-20, 1995, New Orleans, Louisiana
|
 |
37
|
|
| |
38
|
|
| |
39
|
S. Rani, C. Leangsuksun, A. Tikotekar, V. Rampure, and S. Scott, "Toward efficient failre detection and recovery in HPC," in High Availability and Performance Computing Workshop, 2006.
|
 |
40
|
R. K. Sahoo , A. J. Oliner , I. Rish , M. Gupta , J. E. Moreira , S. Ma , R. Vilalta , A. Sivasubramaniam, Critical event prediction for proactive management in large-scale computer clusters, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2003, Washington, D.C.
[doi> 10.1145/956750.956799]
|
| |
41
|
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, "The LAM/MPI checkpoint/restart framework: System-initiated checkpointing," in LACSI, Oct. 2003.
|
| |
42
|
|
| |
43
|
J. M. Squyres and A. Lumsdaine, "A Component Architecture for LAM/MPI," in European PVM/MPI Users' Group Meeting, ser. Lecture Notes in Computer Science, no. 2840. Venice, Italy: Springer-Verlag, September / October 2003, pp. 379--387.
|
| |
44
|
|
| |
45
|
X.-H. Sun, Z. Lan, Y. Li, H. Jin, and Z. Zheng, "Towards a fault-aware computing environment," in HAPCW, Mar. 2008.
|
 |
46
|
|
| |
47
|
A. Tikotekar, C. Leangsuksun, and S. L. Scott, "On the survivability of standard MPI applications," in LCI International Conference on Linux Clusters: The HPC Revolution, May 2006.
|
| |
48
|
A. Tikotekar, G. Vallée, T. Naughton, S. L. Scott, and C. Leangsuksun, "Evaluation of fault-tolerant policies using simulation," in IEEE Cluster, Sep. 17--20, 2007.
|
| |
49
|
|
| |
50
|
Geoffroy Vallee , Kulathep Charoenpornwattana , Christian Engelmann , Anand Tikotekar , Chokchai Leangsuksun , Thomas Naughton , Stephen L. Scott, A Framework for Proactive Fault Tolerance, Proceedings of the 2008 Third International Conference on Availability, Reliability and Security, p.659-664, March 04-07, 2008
[doi> 10.1109/ARES.2008.171]
|
| |
51
|
C. Wang, F. Mueller, C. Engelmann, and S. Scott, "A job pause service under LAM/MPI+BLCR for transparent fault tolerance," in IPDPS, Apr. 2007.
|
 |
52
|
Frederick C. Wong , Richard P. Martin , Remzi H. Arpaci-Dusseau , David E. Culler, Architectural requirements and scalability of the NAS parallel benchmarks, Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), p.41-es, November 14-19, 1999, Portland, Oregon, United States
[doi> 10.1145/331532.331573]
|
 |
53
|
|
CITED BY
|
Stephen L. Scott , Christian Engelmann , Geoffroy R. Vallée , Thomas Naughton , Anand Tikotekar , George Ostrouchov , Chokchai Leangsuksun , Nichamon Naksinehaboon , Raja Nassar , Mihaela Paun , Frank Mueller , Chao Wang , Arun B. Nagarajan , Jyothish Varma, A tunable holistic resiliency approach for high-performance computing systems, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
|