ACM Home Page
Please provide us with feedback. Feedback
CLIP: a checkpointing tool for message-passing parallel programs
Full text PdfPdf (262 KB)
Source Conference on High Performance Networking and Computing archive
Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM) table of contents
San Jose, CA
Pages: 1 - 11  
Year of Publication: 1997
ISBN:0-89791-985-8
Authors
Yuqun Chen  Princeton University, Princeton NJ
James S. Plank  University of Tennessee, Knoxville TN
Kai Li  Princeton University, Princeton NJ
Sponsors
IEEE-CS\DATC : IEEE Computer Society
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 22,   Citation Count: 18
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/509593.509626
What is a DOI?

ABSTRACT

Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpointer: CLIP. CLIP is a user-level library that provides semi-transparent check-pointing for parallel programs on the Intel Paragon multicomputer. It is publicly available to Paragon users at no cost.Conceptually, checkpointing a multicomputer is quite straightforward. However, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design decisions to be made. Sometimes ease-of-use must be sacrificed for efficiency and/or correctness. This paper details what these decisions are, and how they were made in CLIP.We also present performance data when checkpointing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
J. Casas, D. L. Clark, P. S. Galbiati, R. Konuru, S. W. Otto, R. M. Prouty, and J. Walpole. MIST: PVM with transparent migration and checkpointing. In 3rd Annual PVM Users' Group Meeting, Pittsburgh, PA, May 1995.
 
4
 
5
E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.
 
6
E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In 11th Symposium on Reliable Distributed Systems, pages 39-47, October 1992.
 
7
E. N. Elnozahy and W. Zwaenepoel. On the use and implementation of message logging. In 24th International Symposium on Fault-Tolerant Computing, pages 298-307, Austin, TX, June 1994.
 
8
David Bailey et al. The nas parallel benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center, December 1995.
9
 
10
 
11
Y. Huang and C. Kintala. Software implemented fault tolerance: Technologies and experience. In 23rd International Symposium on Fault-Tolerant Computing, pages 2-9, July 1993.
 
12
B. A. Kingsbury and J. T. Kline. Job and process recovery in a UNIX-based operating system. In Usenix Winter 1989 Technical Conference, pages 355-364, San Diego, CA, January 1989.
 
13
C. R. Landau. The checkpoint mechanism in keykos. In Proceedings of the 2nd International Workshop on Object Orientation in Operating Systems, pages 86-91. IEEE, September 1992.
 
14
 
15
Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, 1.0 edition, May 1994.
16
 
17
Paul Pierce. The Paragon implementation of the NX message passing interface. In Proceedings of Scalable High-Performance Computing Conference (SHPCC) 94, 1994.
 
18
J. S. Plank, M. Beck, and G. Kingsley. Compiler-assisted memory exclusion for fast checkpointing. IEEE Technical Committee on Operating Systems and Application Environments, 7(4):10-14, Winter 1995.
 
19
J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under unix. In Usenix Winter 1995 Technical Conference, pages 213-223, January 1995.
 
20
 
21
 
22
L. M. Silva, B. Veer, and J. G. Silva. Checkpointing SPMD applications on transputer networks. In Scalable High Performance Computing Conference, pages 694-701, Knoxville, TN, May 1994.
 
23
 
24
T. Tannenbaum and M. Litzkow. The Condor distributed processing system. Dr. Dobb's Journal, #227:40-48, February 1995.
 
25
P. H. Worley and I. T. Foster. Parallel spectral transform shallow water model: A runtime-tunable parallel benchmark code. In Proceedings of Scalable High-Performance Computing Conference (SHPCC) 94, 1994.

CITED BY  18
Collaborative Colleagues:
Yuqun Chen: colleagues
James S. Plank: colleagues
Kai Li: colleagues