ACM Home Page
Please provide us with feedback. Feedback
Fault injection framework for system resilience evaluation: fake faults for finding future failures
Full text PdfPdf (409 KB)
Source
High Performance Distributed Computing archive
Proceedings of the 2009 workshop on Resiliency in high performance table of contents
Garching, Germany
Pages 23-28  
Year of Publication: 2009
ISBN:978-1-60558-593-2
Authors
Thomas Naughton  Oak Ridge National Laboratory, Oak Ridge, TN, USA
Wesley Bland  Oak Ridge National Laboratory, Oak Ridge, TN, USA
Geoffroy Vallee  Oak Ridge National Laboratory, Oak Ridge, TN, USA
Christian Engelmann  Oak Ridge National Laboratory, Oak Ridge, TN, USA
Stephen L. Scott  Oak Ridge National Laboratory, Oak Ridge, TN, USA
Sponsors
SIGARCH: ACM Special Interest Group on Computer Architecture
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 47,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1552526.1552530
What is a DOI?

ABSTRACT

As high-performance computing (HPC) systems increase in size and complexity they become more difficult to manage. The enormous component counts associated with these large systems lead to significant challenges in system reliability and availability. This in turn is driving research into the resilience of large scale systems, which seeks to curb the effects of increased failures at large scales by masking the inevitable faults in these systems. The basic premise being that failure must be accepted as a reality of large scale system and coped with accordingly through system resilience.

A key component in the development and evaluation of system resilience techniques is having a means to conduct controlled experiments. A common method for performing such experiments is to generate synthetic faults and study the resulting effects. In this paper we discuss the motivation and our initial use of software fault injection to support the evaluation of resilience for HPC systems. We mention background and related work in the area and discuss the design of a tool to aid in fault injection experiments for both user-space (application-level) and system-level failures.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Algirdas Avizienis, Jean-Claude Laprie, and Brian Randell. Fundamental concepts of computer system dependability. In IARP/IEEE-RAS Workshop on Robot Dependability: Technological Challenge of Dependable Robots in Human Environments, May 21-22, 2001.
 
2
 
3
Dbench: Dependability benchmark project. Available at: http://www.laas.fr/DBench/. (Last accessed March 2009).
 
4
Fig: Library-level error injection for shared libraries in unix/linux. Available at: http://roc.cs.berkeley.edu/projects/fig/index.shtml (Last accessed: March 2009).
 
5
 
6
 
7
 
8
Linux fault injection capabilities infrastructure. Documentation available at: http://lxr.linux.no/linux/Documentation/faultinjection/.
 
9
10
 
11
Roc: Recovery-oriented computing. Available at: http://roc.cs.berkeley.edu/ (Last accessed: March 2009).
 
12

Collaborative Colleagues:
Thomas Naughton: colleagues
Wesley Bland: colleagues
Geoffroy Vallee: colleagues
Christian Engelmann: colleagues
Stephen L. Scott: colleagues