ACM Home Page
Please provide us with feedback. Feedback
Compiler-generated staggered checkpointing
Full text PdfPdf (126 KB)
Source ACM International Conference Proceeding Series; Vol. 81 archive
Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems table of contents
Houston, Texas
Pages: 1 - 8  
Year of Publication: 2004
Authors
Alison N. Norman  The University of Texas at Austin
Sung-Eun Choi  Los Alamos National Laboratory
Calvin Lin  The University of Texas at Austin
Sponsors
: University of Houston
: The Texas Learning & Computation Center
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 2,   Downloads (12 Months): 11,   Citation Count: 1
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1066650.1066663
What is a DOI?

ABSTRACT

To minimize work lost due to system failures, large parallel applications perform periodic checkpoints. These checkpoints are typically inserted manually by application programmers, resulting in synchronous checkpoints, or checkpoints that occur at the same program point in all processes. While this solution is tenable for current systems, it will become problematic for future supercomputers that have many tens of thousands of nodes, because contention for both the network and file system will grow. This paper shows that staggered checkpoints---globally consistent checkpoints in which processes perform checkpoints at different points in the code---can significantly reduce network and file system contention. We describe a compiler-based approach for inserting staggered checkpoints, and we show, using trace-driven simulation, that staggered checkpointing is 23 times faster that synchronous checkpointing.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
NASA Ames Research Center. NAS parallel benchmarks. http://www.nas.nasa.gov/Software/NPB.
 
4
 
5
E. Elnozahy, D. Johnson, and Y. Wang. A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.
 
6
7
 
8
Peter B. Ladkin and Stefan Leue. Interpreting message flow graphs. Formal Aspects of Computing, 7(5):473--509, 1995.
9
 
10
 
11
 
12

Collaborative Colleagues:
Alison N. Norman: colleagues
Sung-Eun Choi: colleagues
Calvin Lin: colleagues