ACM Home Page
Please provide us with feedback. Feedback
Adaptive incremental checkpointing for massively parallel systems
Full text PdfPdf (226 KB)
Source
International Conference on Supercomputing archive
Proceedings of the 18th annual international conference on Supercomputing table of contents
Malo, France
SESSION: Middleware for high performance computing table of contents
Pages: 277 - 286  
Year of Publication: 2004
ISBN:1-58113-839-3
Authors
Saurabh Agarwal  IBM India Research Labs, New Delhi, India
Rahul Garg  IBM India Research Labs, New Delhi, India
Meeta S. Gupta  IBM India Research Labs, New Delhi, India
Jose E. Moreira  IBM T.J. Watson Research Center, Yorktown Heights, NY
Sponsors
SIGARCH: ACM Special Interest Group on Computer Architecture
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 87,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1006209.1006248
What is a DOI?

ABSTRACT

Given the scale of massively parallel systems, occurrence of faults is no longer an exception but a regular event. Periodic checkpointing is becoming increasingly important in these systems. However, huge memory footprints of parallel applications place severe limitations on scalability of normal checkpointing techniques. Incremental checkpointing is a well researched technique that addresses scalability concerns, but most of the implementations require paging support from hardware and the underlying operating system, which may not be always available. In this paper, we propose a software based adaptive incremental checkpoint technique which uses a secure hash function to uniquely identify changed blocks in memory. Our algorithm is the first self-optimizing algorithm that dynamically computes the optimal block boundaries, based on the history of changed blocks. This provides better opportunities for minimizing checkpoint file size. Since the hash is computed in software, we do not need any system support for this. We have implemented and tested this mechanism on the BlueGene/L system. Our results on several well-known benchmarks are encouraging, both in terms of reduction in average checkpoint file size and adaptivity towards application's memory access patterns.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
J. S. Plank, J. Xu, and R. H. Netzer, "Compressed differences: An algorithm for fast incremental checkpointing," Tech. Rep. CS-95-302, University of Tennessee at Knoxville, Aug. 1995.
 
4
 
5
M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, "Checkpoint and migration of UNIX processes in the Condor distributed processing system," Tech. Rep. UW-CS-TR-1346, University of Wisconsin - Madison Computer Sciences Department, April 1997.
 
6
Princeton University Scalable I/O Research, "A checkpointing library for Intel Paragon." http://www.cs.princeton.edu/sio/CLIP/.
 
7
J. S. Plank, M. Beck, G. Kingsley, and K. Li, "Libckpt: Transparent checkpointing under Unix," in Usenix Winter Technical Conference, pp. 213--223, Jan. 1995.
 
8
M. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson, "A survey of rollback-recovery protocols in message passing systems," Tech. Rep. CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, Oct. 1996.
9
 
10
11
12
 
13
C. J. Li and W. K. Fuch, "CATCH - Compiler assisted techniques for checkpointing.," in In Proceedings of the Internationnal Symposium on Fault Tolerant Computing, June 1990.
 
14
 
15
 
16
 
17
 
18
E. Pinheiro, "Truly-transparent checkpointing of parallel applications." http://www.cs.rutgers.edu/~edpin/epckpt/paper_html/.
 
19
S. Sankaran, J. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, "The design and implementation of Berkeley lab's Linux checkpoint/restart," in Los Alamos Computer Science Institute (LACSI) Symposium, Oct. 2003.
 
20
H. Zhong and J. Nieh, "CRAK: Linux checkpoint/restart as a kernel module," Tech. Rep. CUCS-014-01, Department of Computer Science, Columbia University, Nov. 2001.
21
22
 
23
J. C. Sancho, F. Petrini, G. Johnson, J. Fernandez, and E. Frachtenberg, "On the feasibility of incremental checkpointing for scientific computing," in International Parallel and Distributed Processing Symposium, (Santa Fe, NM, USA), April 2004.
 
24
H. Nam, J. Kim, S. J. Hong, and S. Lee, "Probabilistic checkpointing," IEICE Transactions, Information and Systems, vol. E85-D, July 2002.
 
25
 
26
 
27
 
28
SUN Microsystems Inc.,, "Soft memory errors and their effect on sun fire system." http://www.sun.com/products-n-solutions/hardware/docs/pdf/816-5053-10.pdf.
 
29
 
30
G. Almasi, R. Bellofatto, J. Brunheroto, C. Ca|scaval, J. G. Castaqos, L. Ceze, P. Crumley, C. C. Erway, J. Gagliano, D. Lieber, X. Martorell, J. E. Moreira, A. Sanomiya, , and K. Strauss, "An Overview of the BlueGene/L System Software Organization," in Euro-Par: 9th International European Conference on Parallel Processing, Aug. 2003.
 
31
D. Bailey, T. Harris, W. Saphir, R. vander Wijngaart, A. Woo, and M. Yarros, "The NAS parallel benchmarks 2.0," Tech. Rep. NAS-95-020, NAS Systems Division, Dec. 1995.
 
32
"ASCI blue benchmarks." http://www.llnl.gov/asci_benchmarks/asci/asci_code_list.html.


Collaborative Colleagues:
Saurabh Agarwal: colleagues
Rahul Garg: colleagues
Meeta S. Gupta: colleagues
Jose E. Moreira: colleagues