|
ABSTRACT
The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources. We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
D. Reed, "High-End Computing: The Challenge of Scale," Director's Colloquium, May 2004.
|
| |
2
|
|
| |
3
|
Ron A. Oldfield , Sarala Arunagiri , Patricia J. Teller , Seetharami Seelam , Maria Ruiz Varela , Rolf Riesen , Philip C. Roth, Modeling the Impact of Checkpoints on Next-Generation Systems, Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, p.30-46, September 24-27, 2007
[doi> 10.1109/MSST.2007.24]
|
| |
4
|
Samsung, Hard Disk Drive, Apr 2009.
|
| |
5
|
G. Grider, J. Loncaric, and D. Limpart, "Roadrunner System Management Report," Los Alamos National Laboratory, Tech. Rep. LA-UR-07-7405, 2007.
|
| |
6
|
S. E. Michalak, K. W. Harris, N. W. Hengartner et al., "Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASCI Q Supercomputer," IEEE Transactions on Device and Materials Reliability, vol. 5, no. 3, pp. 329--335, 2005.
|
| |
7
|
Los Alamos National Laboratory, Reliability Data Sets, http://institutes.lanl.gov/data/fdata/.
|
 |
8
|
|
| |
9
|
|
| |
10
|
S. Y. Lee and K. Kim, "Prospects of Emerging New Memory Technologies," in ICICDT '04. Proceedings of the 2004 International Conference on Integrated Circuit Design and Technology, 2004, pp. 45--51.
|
| |
11
|
S. Hanzawa, N. Kitai, K. Osada et al., "A 512kB Embedded Phase Change Memory with 416kB/s Write Throughput at 100μA Cell Write Current," in ISSCC '07. Proceedings of the 2007 IEEE International Solid-State Circuits Conference, 2007, pp. 474--616.
|
| |
12
|
F. Pellizzer, A. Pirovano, F. Ottogalli et al., "Novel μTrench Phase-Change Memory Cell for Embedded and Stand-Alone Non-Volatile Memory Applications," in Proceedings of the 2004 IEEE Symposium on VLSI Technology, 2004, pp. 18--19.
|
| |
13
|
Y. Zhang, S.-B. Kim, J. P. McVittie et al., "An Integrated Phase Change Memory Cell With Ge Nanowire Diode For Cross-Point Memory," in Proceedings of the 2007 IEEE Symposium on VLSI Technology, 2007, pp. 98--99.
|
| |
14
|
A. Pirovano, A. L. Lacaita, A. Benvenuti et al., "Scaling Analysis of Phase-Change Memory Technology," in IEDM '03. Proceedings of the 2003 IEEE International Electron Devices Meeting, 2003, pp. 29.6.1--29.6.4.
|
| |
15
|
F. Bedeschi, R. Fackenthal, C. Resta, E. M. Donze, M. Jagasivamani et al., "A Bipolar-Selected Phase Change Memory Featuring Multi-Level Cell Storage," IEEE Journal of Solid-State Circuits, vol. 44, no. 1, pp. 217--227, 2009.
|
| |
16
|
X. Dong, N. Jouppi, and Y. Xie, "PCRAMsim: A System-Level Phase-Change RAM Simulator," in ICCAD '09. Proceedings of the 2009 IEEE/ACM International Conference on Computer-Aided Design, 2009.
|
 |
17
|
|
| |
18
|
International Technology Roadmap for Semiconductors, "Process Integration, Devices, and Structures 2007 Edition," http://www.itrs.net/.
|
| |
19
|
Wei Huang , Karthik Sankaranarayanan , Kevin Skadron , Robert J. Ribando , Mircea R. Stan, Accurate, Pre-RTL Temperature-Aware Design Using a Parameterized, Geometric Thermal Model, IEEE Transactions on Computers, v.57 n.9, p.1277-1288, September 2008
[doi> 10.1109/TC.2008.64]
|
 |
20
|
Dana Vantrease , Robert Schreiber , Matteo Monchiero , Moray McLaren , Norman P. Jouppi , Marco Fiorentino , Al Davis , Nathan Binkert , Raymond G. Beausoleil , Jung Ho Ahn, Corona: System Implications of Emerging Nanophotonic Technology, Proceedings of the 35th International Symposium on Computer Architecture, p.153-164, June 21-25, 2008
|
| |
21
|
NASA, "NAS Parallel Benchmarks," http://www.nas.nasa.gov/Resources/Software/npb.html.
|
| |
22
|
J. C. Sancho, F. Petrini, G. Johnson, and E. Frachtenberg, "On the Feasibility of Incremental Checkpointing for Scientific Computing," in IPDPS '04. Proceedings of the 18th International Parallel and Distributed Processing Symposium, 2004, pp. 58--67.
|
| |
23
|
E. Argollo, A. Falcon, P. Faraboschi et al., "COTSon: Infrastructure for Full System Simulation," HP Labs, Tech. Rep. HPL-2008-189, 2008.
|
 |
24
|
|
| |
25
|
|
| |
26
|
A. Oliner, L. Rudolph, and R. Sahoo, "Cooperative Checkpointing Theory," in IPDPS '06. Proceedings of the 20th International Parallel and Distributed Processing Symposium, 2006, pp. 14--23.
|
| |
27
|
|
 |
28
|
Greg Bronevetsky , Daniel J. Marques , Keshav K. Pingali , Radu Rugina , Sally A. McKee, Compiler-enhanced incremental checkpointing for OpenMP applications, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, February 20-23, 2008, Salt Lake City, UT, USA
[doi> 10.1145/1345206.1345253]
|
| |
29
|
|
|