ACM Home Page
Please provide us with feedback. Feedback
Automatic generation and tuning of MPI collective communication routines
Full text PdfPdf (334 KB)
Source International Conference on Supercomputing archive
Proceedings of the 19th annual international conference on Supercomputing table of contents
Cambridge, Massachusetts
SESSION: Session 11: system-wide issues table of contents
Pages: 393 - 402  
Year of Publication: 2005
ISBN:1-59593-167-8
Authors
Ahmad Faraj  Florida State University, Tallahassee, FL
Xin Yuan  Florida State University, Tallahassee, FL
Sponsor
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 82,   Citation Count: 11
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1088149.1088202
What is a DOI?

ABSTRACT

In order for collective communication routines to achieve high performance on different platforms, they must be able to adapt to the system architecture and use different algorithms for different situations. Current Message Passing Interface (MPI) implementations, such as MPICH and LAM/MPI, are not fully adaptable to the system architecture and are not able to achieve high performance on many platforms. In this paper, we present a system that produces efficient MPI collective communication routines. By automatically generating topology specific routines and using an empirical approach to select the best implementations, our system adapts to a given platform and constructs routines that are customized for the platform. The experimental results show that the tuned routines consistently achieve high performance on clusters with different network topologies.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
A. Faraj, P. Patarasuk, and X. Yuan. Bandwidth Efficient All-to-All Broadcast on Switched Clusters. Technical Report, Department of Computer Science, Florida State University, May 2005.
 
4
 
5
NASA Parallel Benchmarks. Available at http://www.nas.nasa.gov/NAS/NPB.
 
6
M. Frigo and S. Johnson. FFTW: An Adaptive Software Architecture for the FFT. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 3, page 1381, 1998.
 
7
W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. In MPI Developers Conference, 1995.
 
8
W. Gropp and E. Lusk. Reproducible Measurements of MPI Performance Characteristics. Technical Report ANL/MCS-P755-0699, Argonne National Labratory, Argonne, IL, June 1999.
 
9
LAM/MPI Parallel Computing. http://www.lam-mpi.org.
 
10
 
11
12
13
 
14
R. Rabenseifner. A new optimized MPI reduce and allreduce algorithms. Available at http://www.hlrs.de/organization/par/services/models/mpi/myreduce.html, 1997.
 
15
The MPI Forum. The MPI-2: Extensions to the Message Passing Interface, July 1997. Available at http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html.
 
16
MPICH - A Portable Implementation of MPI. http://www.mcs.anl.gov/mpi/mpich.
 
17
 
18
I. Rosenblum, J. Adler, and S. Brandon. Multi-processor molecular dynamics using the Brenner potential: Parallelization of an implicit multi-body potential. International Journal of Modern Physics, C 10(1):189--203, Feb. 1999.
19
20
 
21
R. Thakur, R. Rabenseifner, and W. Gropp. Optimizing of Collective Communication Operations in MPICH. ANL/MCS-P1140-0304, Mathematics and Computer Science Division, Argonne National Laboratory, March 2004.
 
22
 
23

CITED BY  11