ACM Home Page
Please provide us with feedback. Feedback
Proactive process-level live migration in HPC environments
Full text PdfPdf (249 KB)
Source Conference on High Performance Networking and Computing archive
Proceedings of the 2008 ACM/IEEE conference on Supercomputing table of contents
Austin, Texas
SECTION: Papers table of contents
Article No. 43  
Year of Publication: 2008
ISBN:978-1-4244-2835-9
Authors
Chao Wang  North Carolina State University, Raleigh, NC
Frank Mueller  North Carolina State University, Raleigh, NC
Christian Engelmann  Oak Ridge National Laboratory, Oak Ridge, TN
Stephen L. Scott  Oak Ridge National Laboratory, Oak Ridge, TN
Publisher
IEEE Press  Piscataway, NJ, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 137,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1413370.1413414
What is a DOI?

ABSTRACT

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission.

This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
"Advanced configuration & power interface," http://www.acpi.info.
 
2
"Readable dirty-bits for IA64 linux," https://www.gelato.unsw.edu.au/archives/gelato-technical/2005-November/001080.html.
 
3
R. T. Aulwes, D. J. Daniel, N. N. Desai, R. L. Graham, L. D. Risinger, M. A. Taylor, T. Woodall, and M. Sukalski, "Architecture of LA-MPI, a network-fault-tolerant MPI," in IPDPS, 2004.
 
4
A. Barak and R. Wheeler, "MOSIX: An integrated multiprocessor UNIX," in Proceedings of the Winter 1989 USENIX Conference. Berkeley, CA, USA: USENIX, 1989, pp. 101--112.
 
5
 
6
 
7
S. Chakravorty, C. Mendes, and L. Kale, "Proactive fault tolerance in MPI applications via task migration," in HiPC, 2006.
 
8
S. Chakravorty, C. Mendes, and L. Kale, "Proactive fault tolerance in large systems," in HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of HPCA-11, 2005.
 
9
S. Chakravorty, C. Mendes, and L. Kale, "A fault tolerance protocol with fast fault recovery," in IPDPS, 2007.
 
10
 
11
 
12
 
13
C. Du, X.-H. Sun, and K. Chanchio, "HPCM: A pre-compiler aided middleware for the mobility of legacy code," in IEEE Cluster, 2003.
 
14
 
15
J. Duell, "The design and implementation of berkeley lab's linux checkpoint/restart," Lawrence Berkeley National Laboratory, TR, 2000.
 
16
 
17
18
 
19
 
20
 
21
 
22
 
23
 
24
J. Hursey, J. M. Squyres, and A. Lumsdaine, "A checkpoint and restart service specification for Open MPI," Indiana University, Computer Science Department, Technical Report, 2006.
 
25
J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine, "The design and implementation of checkpoint/restart process fault tolerance for Open MPI," in DPDNS, Mar. 2007.
26
 
27
 
28
M. Litzkow, "Remote unix - turning idle workstations into cycle servers," in Usenix Summer Conference, 1987, pp. 381--384.
 
29
M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, "Checkpoint and migration of UNIX processes in the Condor distributed processing system," University of Wisconsin - Madison Computer Sciences Department, Tech. Rep. UW-CS-TR-1346, April 1997.
 
30
 
31
32
33
 
34
A. Oliner, R. Sahoo, J. Moreira, M. Gupta, and A. Sivasubramaniam, "Fault-aware job scheduling for BlueGene/L systems," in IPDPS, 2004.
 
35
I. Philp, "Software failures and the road to a petaflop machine," in HPCRI: 1st Workshop on High Performance Computing Reliability Issues, in Proceedings of HPCA-11. IEEE Computer Society, 2005.
 
36
37
 
38
 
39
S. Rani, C. Leangsuksun, A. Tikotekar, V. Rampure, and S. Scott, "Toward efficient failre detection and recovery in HPC," in High Availability and Performance Computing Workshop, 2006.
40
 
41
S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, "The LAM/MPI checkpoint/restart framework: System-initiated checkpointing," in LACSI, Oct. 2003.
 
42
 
43
J. M. Squyres and A. Lumsdaine, "A Component Architecture for LAM/MPI," in European PVM/MPI Users' Group Meeting, ser. Lecture Notes in Computer Science, no. 2840. Venice, Italy: Springer-Verlag, September / October 2003, pp. 379--387.
 
44
 
45
X.-H. Sun, Z. Lan, Y. Li, H. Jin, and Z. Zheng, "Towards a fault-aware computing environment," in HAPCW, Mar. 2008.
46
 
47
A. Tikotekar, C. Leangsuksun, and S. L. Scott, "On the survivability of standard MPI applications," in LCI International Conference on Linux Clusters: The HPC Revolution, May 2006.
 
48
A. Tikotekar, G. Vallée, T. Naughton, S. L. Scott, and C. Leangsuksun, "Evaluation of fault-tolerant policies using simulation," in IEEE Cluster, Sep. 17--20, 2007.
 
49
 
50
 
51
C. Wang, F. Mueller, C. Engelmann, and S. Scott, "A job pause service under LAM/MPI+BLCR for transparent fault tolerance," in IPDPS, Apr. 2007.
52
53


Collaborative Colleagues:
Chao Wang: colleagues
Frank Mueller: colleagues
Christian Engelmann: colleagues
Stephen L. Scott: colleagues