| Cooperative checkpointing: a robust approach to large-scale systems reliability |
| Full text |
Pdf
(917 KB)
|
| Source
|
International Conference on Supercomputing
archive
Proceedings of the 20th annual international conference on Supercomputing
table of contents
Cairns, Queensland, Australia
SESSION: Checkpointing and speculation
table of contents
Pages: 14 - 23
Year of Publication: 2006
ISBN:1-59593-282-8
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 12, Downloads (12 Months): 81, Citation Count: 2
|
|
|
ABSTRACT
Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using both probabilistic and harvested failure distributions reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle. Cooperative checkpointing can be easily implemented on top of existing application-initiated checkpointing mechanisms and may be used to enhance other reliability techniques like QoS guarantees and fault-aware job scheduling. The simulations also support a number of theoretical predictions related to cooperative checkpointing, including the non-competitiveness of periodic checkpointing.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
N. Adiga and T. B. Team. An overview of the bluegene/l supercomputer. In Supercomputing, Technical Papers, Nov. 2002.
|
 |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In 11th Symposium on Reliable Distributed Systems, Houston, TX, Oct. 1992.
|
| |
6
|
|
| |
7
|
D. G. Feitelson. Parallel workloads archive. URL: http://cs.huji.ac.il/labs/parallel/workload/, 2001.
|
| |
8
|
R. K. Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley-Interscience, New York, 1991.
|
| |
9
|
I. Lee, R. K. Iyer, and D. Tang. Error/failure analysis using event logs from fault tolerant systems. In Proceedings of the 21st Intl. Symposium on Fault-Tolerant Computing, pages 10--17, June 1991.
|
| |
10
|
|
| |
11
|
|
 |
12
|
Alison N. Norman , Sung-Eun Choi , Calvin Lin, Compiler-generated staggered checkpointing, Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems, p.1-8, October 22-23, 2004, Houston, Texas
[doi> 10.1145/1066650.1066663]
|
| |
13
|
A. J. Oliner. Cooperative checkpointing for supercomputing systems. Master's thesis, Massachusetts Institute of Technology, 2005.
|
| |
14
|
A. J. Oliner, L. Rudolph, and R. K. Sahoo. Cooperative checkpointing theory. In Proceedings of IPDPS, Intl. Parallel and Distributed Processing Symposium, 2006.
|
| |
15
|
|
| |
16
|
A. J. Oliner and R. K. Sahoo. Evaluating cooperative checkpointing for supercomputing systems. In IEEE IPDPS, Workshop on System Management Tools for Large-scale Parallel Systems, Apr. 2006.
|
| |
17
|
A. J. Oliner, R. K. Sahoo, J. E. Moreira, and M. Gupta. Performance implications of periodic checkpointing on large-scale cluster systems. In IEEE IPDPS, Workshop on System Management Tools for Large-scale Parallel Systems, Apr. 2005.
|
| |
18
|
|
| |
19
|
|
 |
20
|
R. K. Sahoo , A. J. Oliner , I. Rish , M. Gupta , J. E. Moreira , S. Ma , R. Vilalta , A. Sivasubramaniam, Critical event prediction for proactive management in large-scale computer clusters, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2003, Washington, D.C.
[doi> 10.1145/956750.956799]
|
| |
21
|
Martin Schulz , Greg Bronevetsky , Rohit Fernandes , Daniel Marques , Keshav Pingali , Paul Stodghill, Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs, Proceedings of the 2004 ACM/IEEE conference on Supercomputing, p.38, November 06-12, 2004
[doi> 10.1109/SC.2004.29]
|
 |
22
|
|
 |
23
|
|
|