|
ABSTRACT
In order for collective communication routines to achieve high performance on different platforms, they must be able to adapt to the system architecture and use different algorithms for different situations. Current Message Passing Interface (MPI) implementations, such as MPICH and LAM/MPI, are not fully adaptable to the system architecture and are not able to achieve high performance on many platforms. In this paper, we present a system that produces efficient MPI collective communication routines. By automatically generating topology specific routines and using an empirical approach to select the best implementations, our system adapts to a given platform and constructs routines that are customized for the platform. The experimental results show that the tuned routines consistently achieve high performance on clusters with different network topologies.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Jeff Bilmes , Krste Asanovic , Chee-Whye Chin , Jim Demmel, Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology, Proceedings of the 11th international conference on Supercomputing, p.340-347, July 07-11, 1997, Vienna, Austria
[doi> 10.1145/263580.263662]
|
| |
2
|
|
| |
3
|
A. Faraj, P. Patarasuk, and X. Yuan. Bandwidth Efficient All-to-All Broadcast on Switched Clusters. Technical Report, Department of Computer Science, Florida State University, May 2005.
|
| |
4
|
|
| |
5
|
NASA Parallel Benchmarks. Available at http://www.nas.nasa.gov/NAS/NPB.
|
| |
6
|
M. Frigo and S. Johnson. FFTW: An Adaptive Software Architecture for the FFT. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 3, page 1381, 1998.
|
| |
7
|
W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. In MPI Developers Conference, 1995.
|
| |
8
|
W. Gropp and E. Lusk. Reproducible Measurements of MPI Performance Characteristics. Technical Report ANL/MCS-P755-0699, Argonne National Labratory, Argonne, IL, June 1999.
|
| |
9
|
LAM/MPI Parallel Computing. http://www.lam-mpi.org.
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
 |
13
|
Thilo Kielmann , Rutger F. H. Hofman , Henri E. Bal , Aske Plaat , Raoul A. F. Bhoedjang, MagPIe: MPI's collective communication operations for clustered wide area systems, Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming, p.131-140, May 04-06, 1999, Atlanta, Georgia, United States
|
| |
14
|
R. Rabenseifner. A new optimized MPI reduce and allreduce algorithms. Available at http://www.hlrs.de/organization/par/services/models/mpi/myreduce.html, 1997.
|
| |
15
|
The MPI Forum. The MPI-2: Extensions to the Message Passing Interface, July 1997. Available at http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html.
|
| |
16
|
MPICH - A Portable Implementation of MPI. http://www.mcs.anl.gov/mpi/mpich.
|
| |
17
|
|
| |
18
|
I. Rosenblum, J. Adler, and S. Brandon. Multi-processor molecular dynamics using the Brenner potential: Parallelization of an implicit multi-body potential. International Journal of Modern Physics, C 10(1):189--203, Feb. 1999.
|
 |
19
|
Steve Sistare , Rolf vandeVaart , Eugene Loh, Optimization of MPI collectives on clusters of large-scale SMP's, Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), p.23-es, November 14-19, 1999, Portland, Oregon, United States
[doi> 10.1145/331532.331555]
|
 |
20
|
|
| |
21
|
R. Thakur, R. Rabenseifner, and W. Gropp. Optimizing of Collective Communication Operations in MPICH. ANL/MCS-P1140-0304, Mathematics and Computer Science Division, Argonne National Laboratory, March 2004.
|
| |
22
|
Sathish S. Vadhiyar , Graham E. Fagg , Jack Dongarra, Automatically tuned collective communications, Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), p.3-es, November 04-10, 2000, Dallas, Texas, United States
|
| |
23
|
|
CITED BY 11
|
|
Robert Springer , David K. Lowenthal , Barry Rountree , Vincent W. Freeh, Minimizing execution time in MPI programs on an energy-constrained, power-scalable cluster, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, March 29-31, 2006, New York, New York, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|