|
ABSTRACT
MOLAR is a multi-institutional research effort that concentrates on adaptive, reliable, and efficient operating and runtime system (OS/R) solutions for ultra-scale high-end scientific computing on the next generation of supercomputers. This research addresses the challenges outlined in FAST-OS (forum to address scalable technology for runtime and operating systems) and HECRTF (high-end computing revitalization task force) activities by exploring the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions, and by advancing computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues. This paper describes recent research of the MOLAR team in advancing RAS for high-end computing OS/Rs.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
G. Bosilca, Z. Chen, J. J. Dongarra, and J. Langou. Recovery patterns for iterative methods in a parallel unstable environment. Submitted to SIAM Journal on Scientific Computing, 2005.
|
| |
2
|
Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. J. Dongarra. Building fault survivable MPI programs with FT-MPI using diskless checkpointing. Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP), 2005.
|
 |
3
|
David Culler , Richard Karp , David Patterson , Abhijit Sahay , Klaus Erik Schauser , Eunice Santos , Ramesh Subramonian , Thorsten von Eicken, LogP: towards a realistic model of parallel computation, Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.1-12, May 19-22, 1993, San Diego, California, United States
|
 |
4
|
|
| |
5
|
|
| |
6
|
C. Engelmann and G. A. Geist. A lightweight kernel for the Harness metacomputing framework. Proceedings of 14th Heterogeneous Computing Workshop (HCW), pages 120--126, 2005.
|
| |
7
|
C. Engelmann and G. A. Geist. Super-scalable algorithms for computing on 100,000 processors. Lecture Notes in Computer Science: Proceedings of International Conference on Computational Science (ICCS), 3514:313--320, 2005.
|
| |
8
|
C. Engelmann and G. A. Geist. RMIX: A dynamic, heterogeneous, reconfigurable communication framework. Lecture Notes in Computer Science: Proceedings of International Conference on Computational Science (ICCS), 2006.
|
| |
9
|
C. Engelmann and S. L. Scott. Concepts for high availability in scientific high-end computing. Proceedings of High Availability and Performance Computing Workshop (HAPCW), 2005.
|
| |
10
|
|
| |
11
|
C. Engelmann, S. L. Scott, and G. A. Geist. High availability through distributed control. Proceedings of High Availability and Performance Computing Workshop (HAPCW), 2004.
|
| |
12
|
C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. Active/active replication for highly available hpc system services. Proceedings of International Symposium on Frontiers in Availability, Reliability and Security (FARES), 2006.
|
| |
13
|
|
| |
14
|
Forum to Address Scalable Technology for Runtime and Operating Systems. FAST-OS at http://www.fastos.org.
|
| |
15
|
Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS). MOLAR project at http://www.fastos.org/molar.
|
| |
16
|
Fault Tolerant MPI (FT-MPI) Project at University of Tennessee, Knoxville, TN, USA. At http://icl.cs.utk.edu/ftmpi.
|
| |
17
|
E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. Proceedings of 11th European PVM/MPI Users' Group Meeting, 2004.
|
| |
18
|
Al Geist , Adam Beguelin , Jack Dongarra , Weicheng Jiang , Robert Manchek , Vaidy Sunderam, PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing, MIT Press, Cambridge, MA, 1995
|
| |
19
|
G. A. Geist, J. A. Kohl, S. L. Scott, and P. M. Papadopoulos. HARNESS: Adaptable virtual machine environment for heterogeneous clusters. Parallel Processing Letters, 9(2):253--273, 1999.
|
| |
20
|
N. R. Gottumukkala, C. Leangsuksun, and S. L. Scott. Reliability-aware approach to improve job completion time for large-scale parallel applications. Proceedings of 2nd Workshop on High Performance Computing Reliability Issues (HPCRI), 2006.
|
| |
21
|
HA-OSCAR at Louisiana Tech University, Ruston, LA, USA. http://xcr.cenit.latech.edu/ha-oscar.
|
| |
22
|
I. Haddad, C. Leangsuksun, and S. L. Scott. HA-OSCAR: Towards highly available linux clusters. Linux World Magazine, March 2004.
|
| |
23
|
X. He, L. Ou, S. L. Scott, and C. Engelmann. A highly available cluster storage system using scavenging. Proceedings of High Availability and Performance Computing Workshop (HAPCW), 2004.
|
| |
24
|
High-End Computing Revitalization Task Force. HECRTF at http://www.nitrd.gov/subcommittee/hec/hecrtf-outreach.
|
| |
25
|
InfiniBand. http://www.infinibandta.org/home.
|
| |
26
|
Lawrence Berkeley National Laboratory, Berkeley, CA, USA. Berkeley Lab Checkpoint Restart (BLCR) Project at http://ftg.lbl.gov/checkpoint.
|
| |
27
|
Lawrence Livermore National Laboratory, Livermore, CA, USA. Trace logs at http://www.llnl.gov/asci/platforms/white.
|
| |
28
|
C. Leangsuksun, V. K. Munganuru, T. Liu, S. L. Scott, and C. Engelmann. Asymmetric active-active high availability for high-end computing. Proceedings of 2nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2), 2005.
|
| |
29
|
|
| |
30
|
L. Moser, Y. Amir, P. Melliar-Smith, and D. Agarwal. Extended virtual synchrony. Proceedings of 14th International Conference on Distributed Computing Systems (ICDCS), pages 56--65, 1994.
|
| |
31
|
MPICH-V Project at University of Paris - South, France. http://www.Iri.fr/~gk/mpich-v.
|
| |
32
|
MPICH2. http://www-unix.mcs.anl.gov/mpi/mpich2.
|
| |
33
|
MVAPICH2, MPI over InfiniBand Project. http://nowlab.cse.ohio-state.edu/projects/mpi-iba.
|
| |
34
|
J. Nieplocha, V. Tipparaju, M. Krishnan, G. Santhanaraman, and D. Panda. Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication on Clusters. IEEE Cluster Computing 2003, December 2003.
|
| |
35
|
Oak Ridge National Laboratory, TN, USA. Harness project at http://www.csm.ornl.gov/harness.
|
| |
36
|
Open MPI Project. http://www.open-mpi.org.
|
| |
37
|
OpenPBS resource manager at Altair Engineering, Troy, MI, USA. http://www.openpbs.org.
|
| |
38
|
PVFS at Clemson University, Clemson, SC, USA. http://www.parl.clemson.edu/pvfs.
|
| |
39
|
PVM Project at Oak Ridge National Laboratory. Oak Ridge, TN, USA. http://www.csm.ornl.gov/pvm.
|
| |
40
|
R. I. Resnick. A modern taxonomy of high availability, 1996. http://www.generalconcepts.com/resources/reliability/resnick/HA.htm.
|
| |
41
|
Science Case for Large-scale Simulation. SCaLeS at http://www.pnl.gov/scales.
|
| |
42
|
A. G. Shet and P. Sadayappan. Performance Instrumentation to Characterize Computation-Communication Overlap in Message-Passing Systems. Technical Report OSU-CISRC-2/06-TR25, The Ohio State University, February 2006.
|
| |
43
|
SLURM resource manager at Lawrence Livermore National Laboratory, Livermore, CA, USA. http://www.llnl.gov/linux/slurm.
|
| |
44
|
TORQUE resource manager at Cluster Resources, Inc., Spanish Fork, UT, USA. http://www.clusterresources.com.
|
| |
45
|
K. Uhlemann. High availability for ultra-scale high-end scientific computing. Master Thesis at the Department of Computer Science of the University of Reading, UK, March 2006.
|
| |
46
|
J. B. White and S. W. Bova. Where's the Overlap? An Analysis of Popular MPI Implementations. Third MPI Developers' and Users' Conference, March 1999.
|
INDEX TERMS
Primary Classification:
C.
Computer Systems Organization
C.4
PERFORMANCE OF SYSTEMS
Subjects:
Reliability, availability, and serviceability
Additional Classification:
D.
Software
D.3
PROGRAMMING LANGUAGES
D.3.4
Processors
Subjects:
Run-time environments
D.4
OPERATING SYSTEMS
D.4.5
Reliability
Subjects:
Fault-tolerance
D.4.8
Performance
Subjects:
Monitors
General Terms:
Design,
Experimentation,
Performance,
Reliability
Keywords:
RAS,
availability,
fault tolerance,
group membership,
high-end computing,
monitoring,
reliability
|