ACM Home Page
Please provide us with feedback. Feedback
Using queue structures to improve job reliability
Full text PdfPdf (482 KB)
Source
High Performance Distributed Computing archive
Proceedings of the 16th international symposium on High performance distributed computing table of contents
Monterey, California, USA
SESSION: Reliability and fault tolerance table of contents
Pages: 43 - 54  
Year of Publication: 2007
ISBN:978-1-59593-673-8
Authors
Thomas J. Hacker  Purdue University
Zdzislaw Meglicki  Indiana University
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 59,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1272366.1272373
What is a DOI?

ABSTRACT

Many high performance computing systems today exploit the availability and remarkable performance characteristics of stand alone server systems and the impressive price / performance ratio of commodity components. Small scale HPC systems, in the range from 16 to 64 processors, have enjoyed significant popularity and are an indispensable tool for the research community. Scaling up to hundreds and thousands of processors, however, has exposed operational issues, which include system availability and reliability. In this paper, we explore the impact of individual component reliability rates on the overall reliability of an HPC system. We derive a mathematical model for determining the failure rate of the system, the probability of failure of a job running on a subset of the system, and show how to design a reasonable queue structure to provide a reliable system over abroad job mix. We also explore the impact of reliability and queue structure on checkpoint intervals and recovery. Our results demonstrate that it is possible to design a reliable high performance computing system with very good operational reliability characteristics from a collection of moderately reliable components.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
J. T. Daly, "A strategy for running large scale applications based on a model that optimizes the checkpoint interval for restart dumps," International Workshop on Software Engineering for High Performance Computing System Applications, 2004.
 
2
S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, and A. Lumsdaine, "The lam/mpi checkpoint/restart framework: System-initiated checkpointing," International Journal of High Performance Computing Applications, vol. 19, no. 4, pp. 479--493, 2005.
 
3
S. D. Kleban and S. H. Clearwater, "Computation-at-risk: Assessing job portfolio management risk on clusters," in IPDPS. IEEE Computer Society, 2004.
 
4
K. J. Ryan and C. S. Reese, "Estimating reliability trends for the world's fastest computer," Los Alamos National Laboratory, Tech. Rep. LA-UR-00-4201, 2000.
 
5
T. Heath, R. P. Martin, and T. D. Nguyen, "Improving cluster availability using workstation validation," in SIGMETRICS. ACM, 2002, pp. 217--227.
 
6
D. Nurmi, J. Brevik, and R. Wolski, "Quantifying machine availability in networked and desktop grid systems," University of California, Santa Barbara, Computer Science, Tech. Rep. ucsb cs:TR-2003-37, Nov. 2003.
 
7
 
8
D. Nurmi, J. Brevik, and R. Wolski, "Modeling machine availability in enterprise and wide-area distributed computing environments," in Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30-September 2, 2005, Proceedings, ser. Lecture Notes in Computer Science, vol. 3648. Springer, 2005, pp. 432--441.
 
9
Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo, "Performance implications of failures in large-scale cluster scheduling," in JSSPP, ser. Lecture Notes in Computer Science, vol. 3277. Springer, 2004, pp. 233--252.
 
10
 
11
C. Ebeling, An Introduction to Reliability and Maintainability Engineering. Boston, MA: McGraw-Hill, 1997.
 
12
D. L. Grosh, Primer of Reliability Theory. New York, NY: John Wiley, 1989.
 
13
Los Alamos National Laboratory. (2006) Raw operational data on system failures. {Online}. Available: http://www.lanl.gov/projects/computerscience/data/
 
14
EasyFit Statistical Package, "http://www.mathwave.com/products/easyfit.html."
 
15
N. Raju, Gottumukkala, Y. Liu, C. B. Leangsuksun, R. Nassar, and S. Scott2, "Reliability analysis in hpc clusters," Proceedings of the High Availability and Performance Computing Workshop, 2006.
 
16
 
17
 
18
 
19
D. N. P. Murthy, M. Xie, and R. Jiang, Weibull Models. Wiley Series in Probability and Statistics, Wiley-Interscience, 2003.
 
20
M. Rausand and A. Høyland, System Reliability Theory: Models, Statistical Methods and Applications Second Edition. Wiley-Interscience, 2003.
 
21
F. Petrini, "Scaling to Thousands of Processors with Bu®er Coscheduling," in Scaling to New Heights Workshop, Pittsburgh, PA, Aug 2002.
 
22
 
23
 
24
D. Nurmi, R. Wolski, and J. Brevik, "Model-based checkpoint scheduling for volatile resource environments," University of California, Santa Barbara, Computer Science, Tech. Rep. TR-2004-25, Nov. 6 2004.
 
25
N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfield, and C. Vizinok, "A checkpoint and recovery system for the pittsburgh supercomputing center terascale computing system," Pittsburgh Supercomputer Center, Tech. Rep. CMU-PSC-TR-2001-0002, 2001.
26
 
27
R. A. Oldfield, "Investigating lightweight storage and overlay network for fault tolerance," Proceedings of the High Availability and Performance Computing Workshop, 2006.


Collaborative Colleagues:
Thomas J. Hacker: colleagues
Zdzislaw Meglicki: colleagues