|
ABSTRACT
The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Ear- lier work has shown that conventional runtime fault- tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure predic- tion has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a pe- riod of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures.
CITED BY 6
|
|
|
|
|
Jim Brandt , Ann Gentile , Jackson Mayo , Philippe Pébay , Diana Roe , David Thompson , Matthew Wong, Methodologies for advance warning of compute cluster problems via statistical analysis: a case study, Proceedings of the 2009 workshop on Resiliency in high performance, p.7-14, June 09-09, 2009, Garching, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|