ACM Home Page
Please provide us with feedback. Feedback
Runtime optimization of vector operations on large scale SMP clusters
Full text PdfPdf (1.34 MB)
Source
PACT archive
Proceedings of the 17th international conference on Parallel architectures and compilation techniques table of contents
Toronto, Ontario, Canada
SESSION: I/O optimizations table of contents
Pages 122-132  
Year of Publication: 2008
ISBN:978-1-60558-282-5
Authors
Costin Iancu  Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Steven Hofmeyr  Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 82,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1454115.1454134
What is a DOI?

ABSTRACT

"Vector" style communication operations transfer multiple disjoint memory regions within one logical step. These operations are widely used in applications, they do improve application performance, and their behavior has been studied and optimized using different implementation techniques across a large variety of systems. In this paper we present a methodology for the selection of the best performing implementation of a vector operation from multiple alternative implementations. Our approach is designed to work for systems with wide SMP nodes where we believe that most published studies fail to correctly predict performance. Due to the emergence of multi-core processors we believe that techniques similar to ours will be incorporated for performance reasons in communication libraries or language runtimes.

The methodology relies on the exploration of the application space and a classification of the regions within this space where a particular implementation method performs best. We use micro-benchmarks to measure the performance of an implementation for a given point in the application space and then compose profiles that compare the performance of two given implementations. These profiles capture an empirical upper bound for the performance degradation of a given protocol under heavy node load. At runtime, the application selects the implementation according to these performance profiles. Our approach provides performance portability and using our dynamic multi-protocol selection we have been able to improve the performance of a NAS Parallel Benchmarks workload by 22% on an IBM large scale cluster. Very positive results have also been obtained on large scale InfiniBand and Cray XT systems. This work indicates that perhaps the most important factor for application performance on wide SMP systems is the successful management of load on the Network Interface Cards.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Bassi IBM p575 POWER5. LBNL National Energy Research Supercomputing Center.
 
3
Bigben Cray XT3 MPP. Pittsburgh Supercomputing Center.
 
4
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing Bandwidth Limited Problems Using One-Sided Communication and Overlap. Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS), 2006.
 
5
 
6
D. Bonachea. Proposal for Extending the UPC Memory Copy Library Functions and Supporting Extensions to GASNet, v1.0. Technical Report LBNL-56495, Lawrence Berkeley National Laboratory, 2004.
 
7
S. Byna, W. D. Gropp, X.-H. Sun, and R. Thakur. Improving the Performance of MPI Derived Datatypes by Optimizing Memory-Access Cost. In IEEE International Conference on Cluster Computing, 2003.
 
8
Co-Array Fortran - Technical Specification. Available at http://www.co-array.org/caf_def.htm.
 
9
 
10
11
12
 
13
Jacquard AMD Opteron cluster. LBNL National Energy Research Supercomputing Center.
 
14
 
15
M. Krishnan and J. Nieplocha. Optimizing Performance on Linux Clusters Using Advanced Communication Protocols: Achieving Over 10 Teraflops on a 8.6 Teraflops Linpack-Rated Linux Cluster. Proceedings of the 6th International Conference on Linux clusters: The HPC Revolution, 2005.
 
16
The NAS Parallel Benchmarks. Available at http://www.nas.nasa.gov/Software/NPB.
 
17
 
18
 
19
 
20
R. Numrich and J. Reid. Co-Array Fortran for Parallel Programming. Technical Report RAL-TR-1998-060, Rutherford Appleton Laboratory, 1998.
 
21
Ranger SUN Constellation linux Cluster. Texas Advanced Computing Center, University of Texas at Austin.
 
22
G. Santhanaraman, D. Wu, and D. K. Panda. Zero-Copy MPI Derived Datatype Communication over InfiniBand. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 11th European PVM/MPI Users' Group Meeting, 2004.
 
23
24
 
25
V. Tipparaju, M. Krishnan, J. Nieplocha, and D. P. G. Santhanaraman. Exploiting Non-blocking Remote Memory Access Communication in Scientific Benchmarks. Proceedings of the 2003 International Conference on High Performance Computng, HiPC'2003, 2003.
 
26
V. Tipparaju, G. Santhanaraman, J. Nieplocha, and D. K. Panda. Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand. In 18th International Parallel and Distributed Processing Symposium, 2004.
27
 
28
UPC Language Specification, Version 1.0. Available at http://upc.gwu.edu.
 
29
30
 
31
K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A High-Performance Java Dialect. In Proceedings of the ACM 1998 Workshop on Java for High-Performance Network Computing. ACM Press, 1998.

Collaborative Colleagues:
Costin Iancu: colleagues
Steven Hofmeyr: colleagues