|
ABSTRACT
"Vector" style communication operations transfer multiple disjoint memory regions within one logical step. These operations are widely used in applications, they do improve application performance, and their behavior has been studied and optimized using different implementation techniques across a large variety of systems. In this paper we present a methodology for the selection of the best performing implementation of a vector operation from multiple alternative implementations. Our approach is designed to work for systems with wide SMP nodes where we believe that most published studies fail to correctly predict performance. Due to the emergence of multi-core processors we believe that techniques similar to ours will be incorporated for performance reasons in communication libraries or language runtimes. The methodology relies on the exploration of the application space and a classification of the regions within this space where a particular implementation method performs best. We use micro-benchmarks to measure the performance of an implementation for a given point in the application space and then compose profiles that compare the performance of two given implementations. These profiles capture an empirical upper bound for the performance degradation of a given protocol under heavy node load. At runtime, the application selects the implementation according to these performance profiles. Our approach provides performance portability and using our dynamic multi-protocol selection we have been able to improve the performance of a NAS Parallel Benchmarks workload by 22% on an IBM large scale cluster. Very positive results have also been obtained on large scale InfiniBand and Cray XT systems. This work indicates that perhaps the most important factor for application performance on wide SMP systems is the successful management of load on the Network Interface Cards.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Bassi IBM p575 POWER5. LBNL National Energy Research Supercomputing Center.
|
| |
3
|
Bigben Cray XT3 MPP. Pittsburgh Supercomputing Center.
|
| |
4
|
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing Bandwidth Limited Problems Using One-Sided Communication and Overlap. Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS), 2006.
|
| |
5
|
|
| |
6
|
D. Bonachea. Proposal for Extending the UPC Memory Copy Library Functions and Supporting Extensions to GASNet, v1.0. Technical Report LBNL-56495, Lawrence Berkeley National Laboratory, 2004.
|
| |
7
|
S. Byna, W. D. Gropp, X.-H. Sun, and R. Thakur. Improving the Performance of MPI Derived Datatypes by Optimizing Memory-Access Cost. In IEEE International Conference on Cluster Computing, 2003.
|
| |
8
|
Co-Array Fortran - Technical Specification. Available at http://www.co-array.org/caf_def.htm.
|
| |
9
|
|
| |
10
|
|
 |
11
|
|
 |
12
|
|
| |
13
|
Jacquard AMD Opteron cluster. LBNL National Energy Research Supercomputing Center.
|
| |
14
|
|
| |
15
|
M. Krishnan and J. Nieplocha. Optimizing Performance on Linux Clusters Using Advanced Communication Protocols: Achieving Over 10 Teraflops on a 8.6 Teraflops Linpack-Rated Linux Cluster. Proceedings of the 6th International Conference on Linux clusters: The HPC Revolution, 2005.
|
| |
16
|
The NAS Parallel Benchmarks. Available at http://www.nas.nasa.gov/Software/NPB.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
R. Numrich and J. Reid. Co-Array Fortran for Parallel Programming. Technical Report RAL-TR-1998-060, Rutherford Appleton Laboratory, 1998.
|
| |
21
|
Ranger SUN Constellation linux Cluster. Texas Advanced Computing Center, University of Texas at Austin.
|
| |
22
|
G. Santhanaraman, D. Wu, and D. K. Panda. Zero-Copy MPI Derived Datatype Communication over InfiniBand. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 11th European PVM/MPI Users' Group Meeting, 2004.
|
| |
23
|
|
 |
24
|
Nathan Thomas , Gabriel Tanase , Olga Tkachyshyn , Jack Perdue , Nancy M. Amato , Lawrence Rauchwerger, A framework for adaptive algorithm selection in STAPL, Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 15-17, 2005, Chicago, IL, USA
[doi> 10.1145/1065944.1065981]
|
| |
25
|
V. Tipparaju, M. Krishnan, J. Nieplocha, and D. P. G. Santhanaraman. Exploiting Non-blocking Remote Memory Access Communication in Scientific Benchmarks. Proceedings of the 2003 International Conference on High Performance Computng, HiPC'2003, 2003.
|
| |
26
|
V. Tipparaju, G. Santhanaraman, J. Nieplocha, and D. K. Panda. Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand. In 18th International Parallel and Distributed Processing Symposium, 2004.
|
 |
27
|
|
| |
28
|
UPC Language Specification, Version 1.0. Available at http://upc.gwu.edu.
|
| |
29
|
|
 |
30
|
Thorsten von Eicken , David E. Culler , Seth Copen Goldstein , Klaus Erik Schauser, Active messages: a mechanism for integrated communication and computation, Proceedings of the 19th annual international symposium on Computer architecture, p.256-266, May 19-21, 1992, Queensland, Australia
|
| |
31
|
K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A High-Performance Java Dialect. In Proceedings of the ACM 1998 Workshop on Java for High-Performance Network Computing. ACM Press, 1998.
|
|