ACM Home Page
Please provide us with feedback. Feedback
A tunable holistic resiliency approach for high-performance computing systems
Full text PdfPdf (568 KB)
Source
Principles and Practice of Parallel Programming archive
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming table of contents
Raleigh, NC, USA
POSTER SESSION: Posters table of contents
Pages 305-306  
Year of Publication: 2009
ISBN:978-1-60558-397-6
Also published in ...
Authors
Stephen L. Scott  Oak Ridge National Laboratory, Oak Ridge, TN, USA
Christian Engelmann  Oak Ridge National Laboratory, Oak Ridge, TN, USA
Geoffroy R. Vallée  Oak Ridge National Laboratory, Oak Ridge, TN, USA
Thomas Naughton  Oak Ridge National Laboratory, Oak Ridge, TN, USA
Anand Tikotekar  Oak Ridge National Laboratory, Oak Ridge, TN, USA
George Ostrouchov  Oak Ridge National Laboratory, Oak Ridge, TN, USA
Chokchai Leangsuksun  Louisiana Tech University, Ruston, LA, USA
Nichamon Naksinehaboon  Louisiana Tech University, Ruston, LA, USA
Raja Nassar  Louisiana Tech University, Ruston, LA, USA
Mihaela Paun  Louisiana Tech University, Ruston, LA, USA
Frank Mueller  North Carolina State University, Raleigh, NC, USA
Chao Wang  North Carolina State University, Raleigh, NC, USA
Arun B. Nagarajan  North Carolina State University, Raleigh, NC, USA
Jyothish Varma  North Carolina State University, Raleigh, NC, USA
Sponsors
ACM: Association for Computing Machinery
SIGPLAN: ACM Special Interest Group on Programming Languages
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 98,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1504176.1504227
What is a DOI?

ABSTRACT

In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
C. Engelmann, G. R. Vallée, T. Naughton, and S. L. Scott. Proactive fault tolerance using preemptive migration. In Proceedings of the International Conference on Parallel, Distributed, and network-based Processing, Weimar, Germany, Feb. 2009.
2
 
3
A. Tikotekar, G. Vallée, T. Naughton, S. L. Scott, and C. Leangsuksun. Evaluation of fault-tolerant policies using simulation. In Proceedings of the International Conference on Cluster Computing, Austin, TX, USA, Sept. 2007.
 
4
G. R. Vallée, K. Charoenpornwattana, C. Engelmann, A. Tikotekar, C. Leangsuksun, T. Naughton, and S. L. Scott. A framework for proactive fault tolerance. In Proceedings of the International Conference on Availability, Reliability and Security, Barcelona, Spain, Mar. 2007.
 
5
C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. A job pause service under LAM/MPI+BLCR for transparent fault tolerance. In Proceedings of the International Parallel and Distributed Processing Symposium, Long Beach, CA, USA, Mar. 2007.
 
6


Collaborative Colleagues:
Stephen L. Scott: colleagues
Christian Engelmann: colleagues
Geoffroy R. Vallée: colleagues
Thomas Naughton: colleagues
Anand Tikotekar: colleagues
George Ostrouchov: colleagues
Chokchai Leangsuksun: colleagues
Nichamon Naksinehaboon: colleagues
Raja Nassar: colleagues
Mihaela Paun: colleagues
Frank Mueller: colleagues
Chao Wang: colleagues
Arun B. Nagarajan: colleagues
Jyothish Varma: colleagues