|
ABSTRACT
In prior work, we have proposed techniques to extend the ease of shared-memory parallel programming to distributed-memory platforms by automatic translation of OpenMP programs to MPI. In the case of irregular applications, the performance of this translation scheme is limited by the fact that accesses to shared-data cannot be accurately resolved at compile-time. Additionally, irregular applications with high communication to computation ratios pose challenges even for direct implementation on message passing systems. In this paper, we present combined compile-time/run-time techniques for optimizing irregular shared-memory applications on message passing systems in the context of automatic translation from OpenMP to MPI. Our transformations enable computation-communication overlap by restructuring irregular parallel loops. The compiler creates inspectors to analyze actual data access patterns for irregular accesses at runtime. This analysis is combined with the compile-time analysis of regular data accesses to determine which iterations of irregular loops access non-local data. The iterations are then reordered to enable computation-communication overlap. In the case where the irregular access occurs inside nested loops, the loop nest is restructured. We evaluate our techniques by translating OpenMP versions of three benchmarks from two important classes of irregular applications - sparse matrix computations and molecular dynamics. We find that for these applications, on sixteen nodes, versions employing computation-communication overlap are almost twice as fast as baseline OpenMP-to-MPI versions, almost 30% faster than inspector-only versions, almost 25% faster than hand-coded versions on two applications and about 9% slower on the third.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
NR Adiga , G Almasi , GS Almasi , Y Aridor , R Barik , D Beece , R Bellofatto , G Bhanot , R Bickford , M Blumrich , AA Bright , J Brunheroto , C Caşcaval , J Castaños , W Chan , L Ceze , P Coteus , S Chatterjee , D Chen , G Chiu , TM Cipolla , P Crumley , KM Desai , A Deutsch , T Domany , MB Dombrowa , W Donath , M Eleftheriou , C Erway , J Esch , B Fitch , J Gagliano , A Gara , R Garg , R Germain , ME Giampapa , B Gopalsamy , J Gunnels , M Gupta , F Gustavson , S Hall , RA Haring , D Heidel , P Heidelberger , LM Herger , D Hoenicke , RD Jackson , T Jamal-Eddine , GV Kopcsay , E Krevat , MP Kurhekar , AP Lanzetta , D Lieber , LK Liu , M Lu , M Mendell , A Misra , Y Moatti , L Mok , JE Moreira , BJ Nathanson , M Newton , M Ohmacht , A Oliner , V Pandit , RB Pudota , R Rand , R Regan , B Rubin , A Ruehli , S Rus , RK Sahoo , A Sanomiya , E Schenfeld , M Sharma , E Shmueli , S Singh , P Song , V Srinivasan , BD Steinmacher-Burow , K Strauss , C Surovic , R Swetz , T Takken , RB Tremaine , M Tsao , AR Umamaheshwaran , P Verma , P Vranas , TJC Ward , M Wazlowski , W Barrett , C Engel , B Drehmel , B Hilgart , D Hill , F Kasemkhani , D Krolak , CT Li , T Liebsch , J Marcella , A Muff , A Okomo , M Rouse , A Schram , M Tubbs , G Ulsh , C Wait , J Wittrup , M Bae , K Dockser , L Kissel , MK Seager , JS Vetter , K Yates, An overview of the BlueGene/L Supercomputer, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, p.1-22, November 16, 2002, Baltimore, Maryland
|
| |
3
|
G. Almási, R. Bellofatto, J. Brunheroto, C. Caşcaval, J. G. C. nos, L. Ceze, P. Crumley, C. Erway, J. Gagliano, D. Lieber, X. Martorell, J. E. Moreira, A. Sanomiya, and K. Strauss. An overview of the BlueGene/L system software organization. In Proceedings of Euro-Par 2003 Conference, Lecture Notes in Computer Science, Klagenfurt, Austria, August 2003. Springer-Verlag.
|
| |
4
|
Vishal Aslot , Max J. Domeika , Rudolf Eigenmann , Greg Gaertner , Wesley B. Jones , Bodo Parady, SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance, Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming, p.1-10, July 30-31, 2001
|
 |
5
|
|
| |
6
|
B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. Charmm: A program for macromolecular energy, minimization, and dynamics calculations. J. Comp. Chem., 4:187--217, 1983.
|
| |
7
|
G. Burns, R. Daoud, and J. Vaigl. LAM: An Open Cluster Environment for MPI. In Proceedings of Supercomputing Symposium, pages 379--386, 1994.
|
 |
8
|
Steve Carr , Kathryn S. McKinley , Chau-Wen Tseng, Compiler optimizations for improving data locality, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.252-262, October 05-07, 1994, San Jose, California, United States
|
| |
9
|
F. Darema, D. A. George, V. A. Norton, and G. F. Pfister. A single-program-multiple-data computational model for epex/fortran. Parallel Computing, 7(1):11--24, 1988.
|
| |
10
|
R. Das, D. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy. The design and implementation of a parallel unstructured Euler solver using software primitives, AIAA-92-0562. In Proceedings of the 30th Aerospace Sciences Meeting, 1992.
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
O. Forum. OpenMP: A Proposed Industry Standard API for Shared Memory Programming. Technical report, Oct. 1997.
|
| |
16
|
|
 |
17
|
Manish Gupta , Sam Midkiff , Edith Schonberg , Ven Seshadri , David Shields , Ko-Yang Wang , Wai-Mee Ching , Ton Ngo, An HPF compiler for the IBM SP2, Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), p.71-es, December 04-08, 1995, San Diego, California, United States
[doi> 10.1145/224170.224422]
|
| |
18
|
|
| |
19
|
|
| |
20
|
High Performance Fortran Forum. High Performance Fortran language specification, version 1.0. Technical Report CRPC-TR92225, Houston, Tex., 1993.
|
| |
21
|
Paul N. Hilfinger , Dan Bonachea , David Gay , Susan Graham , Ben Liblit , Geoff Pike , Katherine Yelick, Titanium Language Reference Manual, University of California at Berkeley, Berkeley, CA, 2001
|
| |
22
|
D. Hisley, G. Agrawal, P. Satya-narayana, and L. Pollock. Porting and performance evaluation of irregular codes using OpenMP. Concurrency: Practice and Experience, 12(12):1241--1259, 2000.
|
| |
23
|
|
| |
24
|
|
| |
25
|
H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011.
|
| |
26
|
|
 |
27
|
|
| |
28
|
S.-I. Lee, T. A. Johnson, and R. Eigenmann. Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation. In Proc. of the Workshop on Languages and Compilers for Parallel Computing(LCPC'03), pages 539--553. (Springer-Verlag Lecture Notes in Computer Science), Oct. 2003.
|
 |
29
|
Honghui Lu , Alan L. Cox , Sandhya Dwarkadas , Ramakrishnan Rajamony , Willy Zwaenepoel, Compiler and software distributed shared memory support for irregular applications, Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.48-56, June 18-21, 1997, Las Vegas, Nevada, United States
|
| |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
H. Simon. Partitioning of unstructured problems for parallel processing. Computing Systems in Engineering, 2(2-3):135--148, 1991.
|
| |
34
|
|
 |
35
|
|
 |
36
|
Akimasa Yoshida , Kenichi Koshizuka , Hironori Kasahara, Data-localization for Fortran macro-dataflow computation using partial static task assignment, Proceedings of the 10th international conference on Supercomputing, p.61-68, May 25-28, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237578.237586]
|
|