ACM Home Page
Please provide us with feedback. Feedback
BlueGene/L Failure Analysis and Prediction Models
Full text Publisher SitePublisher Site
Source DSN archive
Proceedings of the International Conference on Dependable Systems and Networks table of contents
Pages: 425 - 434  
Year of Publication: 2006
ISBN:0-7695-2607-1
Authors
Yinglung Liang  Rutgers University
Yanyong Zhang  Rutgers University
Anand Sivasubramaniam  Penn State University
Morris Jette  Lawrence Livermore National Laboratory
Ramendra Sahoo  IBM T. J. Watson Research Center
Publisher
IEEE Computer Society  Washington, DC, USA
Bibliometrics
Downloads (6 Weeks): n/a,   Downloads (12 Months): n/a,   Citation Count: 6
Additional Information:

abstract   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: 10.1109/DSN.2006.18

ABSTRACT

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Ear- lier work has shown that conventional runtime fault- tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure predic- tion has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a pe- riod of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures.



Collaborative Colleagues:
Yinglung Liang: colleagues
Yanyong Zhang: colleagues
Anand Sivasubramaniam: colleagues
Morris Jette: colleagues
Ramendra Sahoo: colleagues