ACM Home Page
Please provide us with feedback. Feedback
Cooperative checkpointing: a robust approach to large-scale systems reliability
Full text PdfPdf (917 KB)
Source International Conference on Supercomputing archive
Proceedings of the 20th annual international conference on Supercomputing table of contents
Cairns, Queensland, Australia
SESSION: Checkpointing and speculation table of contents
Pages: 14 - 23  
Year of Publication: 2006
ISBN:1-59593-282-8
Authors
Adam J. Oliner  Stanford University, Palo Alto, CA
Larry Rudolph  MIT, CSAIL, Cambridge, MA
Ramendra K. Sahoo  IBM, T. J. Watson Research Center, Hawthorne, NY
Sponsors
SIGARCH: ACM Special Interest Group on Computer Architecture
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 79,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183401.1183406
What is a DOI?

ABSTRACT

Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using both probabilistic and harvested failure distributions reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle. Cooperative checkpointing can be easily implemented on top of existing application-initiated checkpointing mechanisms and may be used to enhance other reliability techniques like QoS guarantees and fault-aware job scheduling. The simulations also support a number of theoretical predictions related to cooperative checkpointing, including the non-competitiveness of periodic checkpointing.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
N. Adiga and T. B. Team. An overview of the bluegene/l supercomputer. In Supercomputing, Technical Papers, Nov. 2002.
2
 
3
 
4
 
5
E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In 11th Symposium on Reliable Distributed Systems, Houston, TX, Oct. 1992.
 
6
 
7
D. G. Feitelson. Parallel workloads archive. URL: http://cs.huji.ac.il/labs/parallel/workload/, 2001.
 
8
R. K. Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley-Interscience, New York, 1991.
 
9
I. Lee, R. K. Iyer, and D. Tang. Error/failure analysis using event logs from fault tolerant systems. In Proceedings of the 21st Intl. Symposium on Fault-Tolerant Computing, pages 10--17, June 1991.
 
10
 
11
12
 
13
A. J. Oliner. Cooperative checkpointing for supercomputing systems. Master's thesis, Massachusetts Institute of Technology, 2005.
 
14
A. J. Oliner, L. Rudolph, and R. K. Sahoo. Cooperative checkpointing theory. In Proceedings of IPDPS, Intl. Parallel and Distributed Processing Symposium, 2006.
 
15
 
16
A. J. Oliner and R. K. Sahoo. Evaluating cooperative checkpointing for supercomputing systems. In IEEE IPDPS, Workshop on System Management Tools for Large-scale Parallel Systems, Apr. 2006.
 
17
A. J. Oliner, R. K. Sahoo, J. E. Moreira, and M. Gupta. Performance implications of periodic checkpointing on large-scale cluster systems. In IEEE IPDPS, Workshop on System Management Tools for Large-scale Parallel Systems, Apr. 2005.
 
18
 
19
20
 
21
22
23


Collaborative Colleagues:
Adam J. Oliner: colleagues
Larry Rudolph: colleagues
Ramendra K. Sahoo: colleagues