| A tunable holistic resiliency approach for high-performance computing systems |
| Full text |
Pdf
(568 KB)
|
Source
|
Principles and Practice of Parallel Programming
archive
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
table of contents
Raleigh, NC, USA
POSTER SESSION: Posters
table of contents
Pages 305-306
Year of Publication: 2009
ISBN:978-1-60558-397-6
Also published in ...
|
|
Authors
|
|
Stephen L. Scott
|
Oak Ridge National Laboratory, Oak Ridge, TN, USA
|
|
Christian Engelmann
|
Oak Ridge National Laboratory, Oak Ridge, TN, USA
|
|
Geoffroy R. Vallée
|
Oak Ridge National Laboratory, Oak Ridge, TN, USA
|
|
Thomas Naughton
|
Oak Ridge National Laboratory, Oak Ridge, TN, USA
|
|
Anand Tikotekar
|
Oak Ridge National Laboratory, Oak Ridge, TN, USA
|
|
George Ostrouchov
|
Oak Ridge National Laboratory, Oak Ridge, TN, USA
|
|
Chokchai Leangsuksun
|
Louisiana Tech University, Ruston, LA, USA
|
|
Nichamon Naksinehaboon
|
Louisiana Tech University, Ruston, LA, USA
|
|
Raja Nassar
|
Louisiana Tech University, Ruston, LA, USA
|
|
Mihaela Paun
|
Louisiana Tech University, Ruston, LA, USA
|
|
Frank Mueller
|
North Carolina State University, Raleigh, NC, USA
|
|
Chao Wang
|
North Carolina State University, Raleigh, NC, USA
|
|
Arun B. Nagarajan
|
North Carolina State University, Raleigh, NC, USA
|
|
Jyothish Varma
|
North Carolina State University, Raleigh, NC, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 19, Downloads (12 Months): 98, Citation Count: 1
|
|
|
ABSTRACT
In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
C. Engelmann, G. R. Vallée, T. Naughton, and S. L. Scott. Proactive fault tolerance using preemptive migration. In Proceedings of the International Conference on Parallel, Distributed, and network-based Processing, Weimar, Germany, Feb. 2009.
|
 |
2
|
|
| |
3
|
A. Tikotekar, G. Vallée, T. Naughton, S. L. Scott, and C. Leangsuksun. Evaluation of fault-tolerant policies using simulation. In Proceedings of the International Conference on Cluster Computing, Austin, TX, USA, Sept. 2007.
|
| |
4
|
G. R. Vallée, K. Charoenpornwattana, C. Engelmann, A. Tikotekar, C. Leangsuksun, T. Naughton, and S. L. Scott. A framework for proactive fault tolerance. In Proceedings of the International Conference on Availability, Reliability and Security, Barcelona, Spain, Mar. 2007.
|
| |
5
|
C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. A job pause service under LAM/MPI+BLCR for transparent fault tolerance. In Proceedings of the International Parallel and Distributed Processing Symposium, Long Beach, CA, USA, Mar. 2007.
|
| |
6
|
|
|