|
ABSTRACT
Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpointer: CLIP. CLIP is a user-level library that provides semi-transparent check-pointing for parallel programs on the Intel Paragon multicomputer. It is publicly available to Paragon users at no cost.Conceptually, checkpointing a multicomputer is quite straightforward. However, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design decisions to be made. Sometimes ease-of-use must be sacrificed for efficiency and/or correctness. This paper details what these decisions are, and how they were made in CLIP.We also present performance data when checkpointing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
J. Casas, D. L. Clark, P. S. Galbiati, R. Konuru, S. W. Otto, R. M. Prouty, and J. Walpole. MIST: PVM with transparent migration and checkpointing. In 3rd Annual PVM Users' Group Meeting, Pittsburgh, PA, May 1995.
|
| |
4
|
|
| |
5
|
E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.
|
| |
6
|
E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In 11th Symposium on Reliable Distributed Systems, pages 39-47, October 1992.
|
| |
7
|
E. N. Elnozahy and W. Zwaenepoel. On the use and implementation of message logging. In 24th International Symposium on Fault-Tolerant Computing, pages 298-307, Austin, TX, June 1994.
|
| |
8
|
David Bailey et al. The nas parallel benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center, December 1995.
|
 |
9
|
|
| |
10
|
|
| |
11
|
Y. Huang and C. Kintala. Software implemented fault tolerance: Technologies and experience. In 23rd International Symposium on Fault-Tolerant Computing, pages 2-9, July 1993.
|
| |
12
|
B. A. Kingsbury and J. T. Kline. Job and process recovery in a UNIX-based operating system. In Usenix Winter 1989 Technical Conference, pages 355-364, San Diego, CA, January 1989.
|
| |
13
|
C. R. Landau. The checkpoint mechanism in keykos. In Proceedings of the 2nd International Workshop on Object Orientation in Operating Systems, pages 86-91. IEEE, September 1992.
|
| |
14
|
|
| |
15
|
Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, 1.0 edition, May 1994.
|
 |
16
|
Phyllis E. Crandall , Ruth A. Aydt , Andrew A. Chien , Daniel A. Reed, Input/output characteristics of scalable parallel applications, Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), p.59-es, December 04-08, 1995, San Diego, California, United States
[doi> 10.1145/224170.224396]
|
| |
17
|
Paul Pierce. The Paragon implementation of the NX message passing interface. In Proceedings of Scalable High-Performance Computing Conference (SHPCC) 94, 1994.
|
| |
18
|
J. S. Plank, M. Beck, and G. Kingsley. Compiler-assisted memory exclusion for fast checkpointing. IEEE Technical Committee on Operating Systems and Application Environments, 7(4):10-14, Winter 1995.
|
| |
19
|
J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under unix. In Usenix Winter 1995 Technical Conference, pages 213-223, January 1995.
|
| |
20
|
|
| |
21
|
|
| |
22
|
L. M. Silva, B. Veer, and J. G. Silva. Checkpointing SPMD applications on transputer networks. In Scalable High Performance Computing Conference, pages 694-701, Knoxville, TN, May 1994.
|
| |
23
|
|
| |
24
|
T. Tannenbaum and M. Litzkow. The Condor distributed processing system. Dr. Dobb's Journal, #227:40-48, February 1995.
|
| |
25
|
P. H. Worley and I. T. Foster. Parallel spectral transform shallow water model: A runtime-tunable parallel benchmark code. In Proceedings of Scalable High-Performance Computing Conference (SHPCC) 94, 1994.
|
CITED BY 18
|
|
|
|
|
George Bosilca , Aurelien Bouteiller , Franck Cappello , Samir Djilali , Gilles Fedak , Cecile Germain , Thomas Herault , Pierre Lemarinier , Oleg Lodygensky , Frederic Magniette , Vincent Neri , Anton Selikhov, MPICH-V: toward a scalable fault tolerant MPI for volatile nodes, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, p.1-18, November 16, 2002, Baltimore, Maryland
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sayantan Chakravorty , Celso L. Mendes , Laxmikant V. Kalé , Terry Jones , Andrew Tauferner , Todd Inglett , José Moreira, HPC-Colony: services and interfaces for very large systems, ACM SIGOPS Operating Systems Review, v.40 n.2, April 2006
|
|
|
Sudarshan M. Srinivasan , Srikanth Kandula , Christopher R. Andrews , Yuanyuan Zhou, Flashback: a lightweight extension for rollback and deterministic replay for software debugging, Proceedings of the USENIX Annual Technical Conference 2004 on USENIX Annual Technical Conference, p.3-3, June 27-July 02, 2004, Boston, MA
|
|
|
|
|
|
|
|
|
|
|
|
Yuanyuan Zhou , Darko Marinov , William Sanders , Craig Zilles , Marcelo d'Amorim , Steven Lauterburg , Ryan M. Lefever , Joe Tucek, Delta execution for software reliability, Proceedings of the 3rd conference on Third Workshop on Hot Topics in System Dependability, p.16-16, June 26, 2007, Edinburgh, UK
|
|
|
Bouteiller Bouteiller , Franck Cappello , Thomas Herault , Krawezik Krawezik , Pierre Lemarinier , Magniette Magniette, MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p.25, November 15-21, 2003
|
|
|
|
|
|
Ardalan Kangarlou , Dongyan Xu , Paul Ruth , Patrick Eugster, Taking snapshots of virtual networked environments, Proceedings of the 3rd international workshop on Virtualization technology in distributed computing, p.1-8, November 12-12, 2007, Reno, Nevada
|
|