ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
Full text PdfPdf (1.08 MB)
Source Conference on High Performance Networking and Computing archive
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis table of contents
Portland, Oregon
SESSION: Technical papers table of contents
Article No.: 57  
Year of Publication: 2009
ISBN:978-1-60558-744-8
Authors
Xiangyu Dong  Pennsylvania State University
Naveen Muralimanohar  Hewlett-Packard Labs
Norm Jouppi  Hewlett-Packard Labs
Richard Kaufmann  Hewlett-Packard Labs
Yuan Xie  Pennsylvania State University
Sponsors
SIGARCH: ACM Special Interest Group on Computer Architecture
: IEEE CS
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 25,   Downloads (12 Months): 68,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1654059.1654117
What is a DOI?

ABSTRACT

The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources.

We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
D. Reed, "High-End Computing: The Challenge of Scale," Director's Colloquium, May 2004.
 
2
 
3
 
4
Samsung, Hard Disk Drive, Apr 2009.
 
5
G. Grider, J. Loncaric, and D. Limpart, "Roadrunner System Management Report," Los Alamos National Laboratory, Tech. Rep. LA-UR-07-7405, 2007.
 
6
S. E. Michalak, K. W. Harris, N. W. Hengartner et al., "Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory's ASCI Q Supercomputer," IEEE Transactions on Device and Materials Reliability, vol. 5, no. 3, pp. 329--335, 2005.
 
7
Los Alamos National Laboratory, Reliability Data Sets, http://institutes.lanl.gov/data/fdata/.
8
 
9
 
10
S. Y. Lee and K. Kim, "Prospects of Emerging New Memory Technologies," in ICICDT '04. Proceedings of the 2004 International Conference on Integrated Circuit Design and Technology, 2004, pp. 45--51.
 
11
S. Hanzawa, N. Kitai, K. Osada et al., "A 512kB Embedded Phase Change Memory with 416kB/s Write Throughput at 100μA Cell Write Current," in ISSCC '07. Proceedings of the 2007 IEEE International Solid-State Circuits Conference, 2007, pp. 474--616.
 
12
F. Pellizzer, A. Pirovano, F. Ottogalli et al., "Novel μTrench Phase-Change Memory Cell for Embedded and Stand-Alone Non-Volatile Memory Applications," in Proceedings of the 2004 IEEE Symposium on VLSI Technology, 2004, pp. 18--19.
 
13
Y. Zhang, S.-B. Kim, J. P. McVittie et al., "An Integrated Phase Change Memory Cell With Ge Nanowire Diode For Cross-Point Memory," in Proceedings of the 2007 IEEE Symposium on VLSI Technology, 2007, pp. 98--99.
 
14
A. Pirovano, A. L. Lacaita, A. Benvenuti et al., "Scaling Analysis of Phase-Change Memory Technology," in IEDM '03. Proceedings of the 2003 IEEE International Electron Devices Meeting, 2003, pp. 29.6.1--29.6.4.
 
15
F. Bedeschi, R. Fackenthal, C. Resta, E. M. Donze, M. Jagasivamani et al., "A Bipolar-Selected Phase Change Memory Featuring Multi-Level Cell Storage," IEEE Journal of Solid-State Circuits, vol. 44, no. 1, pp. 217--227, 2009.
 
16
X. Dong, N. Jouppi, and Y. Xie, "PCRAMsim: A System-Level Phase-Change RAM Simulator," in ICCAD '09. Proceedings of the 2009 IEEE/ACM International Conference on Computer-Aided Design, 2009.
17
 
18
International Technology Roadmap for Semiconductors, "Process Integration, Devices, and Structures 2007 Edition," http://www.itrs.net/.
 
19
20
 
21
NASA, "NAS Parallel Benchmarks," http://www.nas.nasa.gov/Resources/Software/npb.html.
 
22
J. C. Sancho, F. Petrini, G. Johnson, and E. Frachtenberg, "On the Feasibility of Incremental Checkpointing for Scientific Computing," in IPDPS '04. Proceedings of the 18th International Parallel and Distributed Processing Symposium, 2004, pp. 58--67.
 
23
E. Argollo, A. Falcon, P. Faraboschi et al., "COTSon: Infrastructure for Full System Simulation," HP Labs, Tech. Rep. HPL-2008-189, 2008.
24
 
25
 
26
A. Oliner, L. Rudolph, and R. Sahoo, "Cooperative Checkpointing Theory," in IPDPS '06. Proceedings of the 20th International Parallel and Distributed Processing Symposium, 2006, pp. 14--23.
 
27
28
 
29

Collaborative Colleagues:
Xiangyu Dong: colleagues
Naveen Muralimanohar: colleagues
Norm Jouppi: colleagues
Richard Kaufmann: colleagues
Yuan Xie: colleagues