ACM Home Page
Please provide us with feedback. Feedback
MOLAR: adaptive runtime support for high-end computing operating and runtime systems
Full text PdfPdf (522 KB)
Source ACM SIGOPS Operating Systems Review archive
Volume 40 ,  Issue 2  (April 2006) table of contents
COLUMN: Operating and runtime systems for high-end computing systems table of contents
Pages: 63 - 72  
Year of Publication: 2006
ISSN:0163-5980
Authors
Christian Engelmann  Oak Ridge National Laboratory, Oak Ridge, TN
Stephen L. Scott  Oak Ridge National Laboratory, Oak Ridge, TN
David E. Bernholdt  Oak Ridge National Laboratory, Oak Ridge, TN
Narasimha R. Gottumukkala  Louisiana Tech University, Ruston, LA
Chokchai Leangsuksun  Louisiana Tech University, Ruston, LA
Jyothish Varma  North Carolina State University, Raleigh, NC
Chao Wang  North Carolina State University, Raleigh, NC
Frank Mueller  North Carolina State University, Raleigh, NC
Aniruddha G. Shet  The Ohio State University, Columbus, OH
P. Sadayappan  The Ohio State University, Columbus, OH
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 0,   Downloads (12 Months): 34,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1131322.1131337
What is a DOI?

ABSTRACT

MOLAR is a multi-institutional research effort that concentrates on adaptive, reliable, and efficient operating and runtime system (OS/R) solutions for ultra-scale high-end scientific computing on the next generation of supercomputers. This research addresses the challenges outlined in FAST-OS (forum to address scalable technology for runtime and operating systems) and HECRTF (high-end computing revitalization task force) activities by exploring the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions, and by advancing computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues. This paper describes recent research of the MOLAR team in advancing RAS for high-end computing OS/Rs.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
G. Bosilca, Z. Chen, J. J. Dongarra, and J. Langou. Recovery patterns for iterative methods in a parallel unstable environment. Submitted to SIAM Journal on Scientific Computing, 2005.
 
2
Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. J. Dongarra. Building fault survivable MPI programs with FT-MPI using diskless checkpointing. Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP), 2005.
3
4
 
5
 
6
C. Engelmann and G. A. Geist. A lightweight kernel for the Harness metacomputing framework. Proceedings of 14th Heterogeneous Computing Workshop (HCW), pages 120--126, 2005.
 
7
C. Engelmann and G. A. Geist. Super-scalable algorithms for computing on 100,000 processors. Lecture Notes in Computer Science: Proceedings of International Conference on Computational Science (ICCS), 3514:313--320, 2005.
 
8
C. Engelmann and G. A. Geist. RMIX: A dynamic, heterogeneous, reconfigurable communication framework. Lecture Notes in Computer Science: Proceedings of International Conference on Computational Science (ICCS), 2006.
 
9
C. Engelmann and S. L. Scott. Concepts for high availability in scientific high-end computing. Proceedings of High Availability and Performance Computing Workshop (HAPCW), 2005.
 
10
 
11
C. Engelmann, S. L. Scott, and G. A. Geist. High availability through distributed control. Proceedings of High Availability and Performance Computing Workshop (HAPCW), 2004.
 
12
C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. Active/active replication for highly available hpc system services. Proceedings of International Symposium on Frontiers in Availability, Reliability and Security (FARES), 2006.
 
13
 
14
Forum to Address Scalable Technology for Runtime and Operating Systems. FAST-OS at http://www.fastos.org.
 
15
Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS). MOLAR project at http://www.fastos.org/molar.
 
16
Fault Tolerant MPI (FT-MPI) Project at University of Tennessee, Knoxville, TN, USA. At http://icl.cs.utk.edu/ftmpi.
 
17
E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. Proceedings of 11th European PVM/MPI Users' Group Meeting, 2004.
 
18
 
19
G. A. Geist, J. A. Kohl, S. L. Scott, and P. M. Papadopoulos. HARNESS: Adaptable virtual machine environment for heterogeneous clusters. Parallel Processing Letters, 9(2):253--273, 1999.
 
20
N. R. Gottumukkala, C. Leangsuksun, and S. L. Scott. Reliability-aware approach to improve job completion time for large-scale parallel applications. Proceedings of 2nd Workshop on High Performance Computing Reliability Issues (HPCRI), 2006.
 
21
HA-OSCAR at Louisiana Tech University, Ruston, LA, USA. http://xcr.cenit.latech.edu/ha-oscar.
 
22
I. Haddad, C. Leangsuksun, and S. L. Scott. HA-OSCAR: Towards highly available linux clusters. Linux World Magazine, March 2004.
 
23
X. He, L. Ou, S. L. Scott, and C. Engelmann. A highly available cluster storage system using scavenging. Proceedings of High Availability and Performance Computing Workshop (HAPCW), 2004.
 
24
High-End Computing Revitalization Task Force. HECRTF at http://www.nitrd.gov/subcommittee/hec/hecrtf-outreach.
 
25
InfiniBand. http://www.infinibandta.org/home.
 
26
Lawrence Berkeley National Laboratory, Berkeley, CA, USA. Berkeley Lab Checkpoint Restart (BLCR) Project at http://ftg.lbl.gov/checkpoint.
 
27
Lawrence Livermore National Laboratory, Livermore, CA, USA. Trace logs at http://www.llnl.gov/asci/platforms/white.
 
28
C. Leangsuksun, V. K. Munganuru, T. Liu, S. L. Scott, and C. Engelmann. Asymmetric active-active high availability for high-end computing. Proceedings of 2nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2), 2005.
 
29
 
30
L. Moser, Y. Amir, P. Melliar-Smith, and D. Agarwal. Extended virtual synchrony. Proceedings of 14th International Conference on Distributed Computing Systems (ICDCS), pages 56--65, 1994.
 
31
MPICH-V Project at University of Paris - South, France. http://www.Iri.fr/~gk/mpich-v.
 
32
MPICH2. http://www-unix.mcs.anl.gov/mpi/mpich2.
 
33
MVAPICH2, MPI over InfiniBand Project. http://nowlab.cse.ohio-state.edu/projects/mpi-iba.
 
34
J. Nieplocha, V. Tipparaju, M. Krishnan, G. Santhanaraman, and D. Panda. Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication on Clusters. IEEE Cluster Computing 2003, December 2003.
 
35
Oak Ridge National Laboratory, TN, USA. Harness project at http://www.csm.ornl.gov/harness.
 
36
Open MPI Project. http://www.open-mpi.org.
 
37
OpenPBS resource manager at Altair Engineering, Troy, MI, USA. http://www.openpbs.org.
 
38
PVFS at Clemson University, Clemson, SC, USA. http://www.parl.clemson.edu/pvfs.
 
39
PVM Project at Oak Ridge National Laboratory. Oak Ridge, TN, USA. http://www.csm.ornl.gov/pvm.
 
40
R. I. Resnick. A modern taxonomy of high availability, 1996. http://www.generalconcepts.com/resources/reliability/resnick/HA.htm.
 
41
Science Case for Large-scale Simulation. SCaLeS at http://www.pnl.gov/scales.
 
42
A. G. Shet and P. Sadayappan. Performance Instrumentation to Characterize Computation-Communication Overlap in Message-Passing Systems. Technical Report OSU-CISRC-2/06-TR25, The Ohio State University, February 2006.
 
43
SLURM resource manager at Lawrence Livermore National Laboratory, Livermore, CA, USA. http://www.llnl.gov/linux/slurm.
 
44
TORQUE resource manager at Cluster Resources, Inc., Spanish Fork, UT, USA. http://www.clusterresources.com.
 
45
K. Uhlemann. High availability for ultra-scale high-end scientific computing. Master Thesis at the Department of Computer Science of the University of Reading, UK, March 2006.
 
46
J. B. White and S. W. Bova. Where's the Overlap? An Analysis of Popular MPI Implementations. Third MPI Developers' and Users' Conference, March 1999.


Collaborative Colleagues:
Christian Engelmann: colleagues
Stephen L. Scott: colleagues
David E. Bernholdt: colleagues
Narasimha R. Gottumukkala: colleagues
Chokchai Leangsuksun: colleagues
Jyothish Varma: colleagues
Chao Wang: colleagues
Frank Mueller: colleagues
Aniruddha G. Shet: colleagues
P. Sadayappan: colleagues