|
ABSTRACT
Overlapping communication with computation is an important optimization on current cluster architectures; its importance is likely to increase as the doubling of processing power far outpaces any improvements in communication latency. PGAS languages offer unique opportunities for communication overlap, because their one-sided communication model enables low overhead data transfer. Recent results have shown the value of hiding latency by manually applying language-level nonblocking data transfer routines, but this process can be both tedious and error-prone. In this paper, we present a runtime framework that automatically schedules the data transfers to achieve overlap. The optimization framework is entirely transparent to the user, and aggressively reorders and aggregates both remote puts and gets. We preserve correctness via runtime conflict checks and temporary buffers, using several techniques to lower the overhead. Experimental results on application benchmarks suggest that our framework can be very effective at hiding communication latency on clusters, improving performance over the blocking code by an average of 16% for some of the NAS Parallel Benchmarks, 48% for GUPS, and over 25% for a multi-block fluid dynamics solver. While the system is not yet as effective as aggressive manual optimization, it increases programmers' productivity by freeing them from the details of communication management.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, Fall 1991.
|
| |
2
|
Christian Bell , Dan Bonachea , Yannick Cote , Jason Duell , Paul Hargrove , Parry Husbands , Costin Iancu , Michael Welcome , Katherine Yelick, An Evaluation of Current High-Performance Networks, Proceedings of the 17th International Symposium on Parallel and Distributed Processing, p.28.1, April 22-26, 2003
|
| |
3
|
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing Bandwidth Limited Problems Using One-sided Communication and Overlap. In 20th International Parallel and Distributed Processing Symposium (IPDPS), 2006.
|
| |
4
|
K. Berlin, J. Huan, M. Jacob, et al. Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures. In 16th International Workshop on Languages and Compilers for Parallel Processing (LCPC), October 2003.
|
| |
5
|
|
| |
6
|
D. Bonachea. Proposal for Extending the UPC Memory Copy Library Functions and Supporting Extensions to GASNet. Technical Report LBNL-56495, Lawrence Berkeley National Lab, October 2004.
|
 |
7
|
John B. Carter , John K. Bennett , Willy Zwaenepoel, Implementation and performance of Munin, Proceedings of the thirteenth ACM symposium on Operating systems principles, p.152-164, October 13-16, 1991, Pacific Grove, California, United States
|
 |
8
|
Soumen Chakrabarti , Manish Gupta , Jong-Deok Choi, Global communication analysis and optimization, Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation, p.68-78, May 21-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
9
|
CHARM++ project web page. Available at http://charm.cs.uiuc.edu.
|
 |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
|
 |
14
|
Cristian Coarfa , Yuri Dotsenko , John Mellor-Crummey , François Cantonnet , Tarek El-Ghazawi , Ashrujit Mohanti , Yiyi Yao , Daniel Chavarría-Miranda, An evaluation of global address space languages: co-array fortran and unified parallel C, Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 15-17, 2005, Chicago, IL, USA
[doi> 10.1145/1065944.1065950]
|
| |
15
|
Compaq UPC version 2.0 for Tru64 UNIX. http://h30097.www3.hp.com/upc/.
|
 |
16
|
A. Krishnamurthy , D. E. Culler , A. Dusseau , S. C. Goldstein , S. Lumetta , T. von Eicken , K. Yelick, Parallel programming in Split-C, Proceedings of the 1993 ACM/IEEE conference on Supercomputing, p.262-273, December 1993, Portland, Oregon, United States
[doi> 10.1145/169627.169724]
|
| |
17
|
K. Datta, D. Bonachea, and K. Yelick. Titanium Performance and Potential: an NPB Experimental Study. In 18th International Workshop on Languages and Compilers for Parallel Processing (LCPC), 2005.
|
| |
18
|
|
| |
19
|
A. Faraj and X. Yuan. Communication characteristics in the NAS parallel benchmarks. In 14th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2002), November 2002.
|
| |
20
|
|
| |
21
|
|
| |
22
|
L. Iftode and J. P. Singh. Shared Virtual Memory: Progress and Challenges. Proc. of the IEEE, Special Issue on Distributed Shared Memory, 87(3):498--507, 1999.
|
| |
23
|
Jacob Sorensen and Scott Baden. A Data Driven Model for Tolerating Communication Delays. In Proceedings of the 12th SIAM Conference on Parallel Processing for Scientific Computing, 2006.
|
| |
24
|
|
| |
25
|
|
| |
26
|
Pete Keleher , Alan L. Cox , Sandhya Dwarkadas , Willy Zwaenepoel, TreadMarks: distributed shared memory on standard workstations and operating systems, Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, p.10-10, January 17-21, 1994, San Francisco, California
|
 |
27
|
|
| |
28
|
R. Numrich and J. Reid. Co-Array Fortran for Parallel Programming. Technical Report RAL-TR-1998-060, Rutherford Appleton Laboratory, 1998.
|
 |
29
|
|
| |
30
|
POSIX standard. http://www.opengroup.org/onlinepubs/009695399/.
|
| |
31
|
J. Savant and S. Seidel. MuPC: A Run Time System for Unified Parallel C. Technical Report CS-TR-02-03, Department of Computer Science, Michigan Techincal University, September 2002.
|
 |
32
|
Daniel J. Scales , Kourosh Gharachorloo , Chandramohan A. Thekkath, Shasta: a low overhead, software-only approach for supporting fine-grain shared memory, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.174-185, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
33
|
J. Su and K. Yelick. Array Prefetching for Irregular Array Accesses in Titanium. In Sixth Annual Workshop on Java for Parallel and Distributed Computing, 2004.
|
| |
34
|
UPC language specifications, v1.2. Technical Report LBNL-59208, Berkeley National Lab, 2005.
|
| |
35
|
K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance Java dialect. Concurrency: Practice and Experience, 10:825--836, 1998.
|
| |
36
|
|
 |
37
|
|
CITED BY 3
|
|
|
|
|
Katherine Yelick , Dan Bonachea , Wei-Yu Chen , Phillip Colella , Kaushik Datta , Jason Duell , Susan L. Graham , Paul Hargrove , Paul Hilfinger , Parry Husbands , Costin Iancu , Amir Kamil , Rajesh Nishtala , Jimmy Su , Michael Welcome , Tong Wen, Productivity and performance using partitioned global address space languages, Proceedings of the 2007 international workshop on Parallel symbolic computation, July 27-28, 2007, London, Ontario, Canada
|
|
|
Anthony Danalis , Lori Pollock , Martin Swany , John Cavazos, MPI-aware compiler optimizations for improving communication-computation overlap, Proceedings of the 23rd international conference on Supercomputing, June 08-12, 2009, Yorktown Heights, NY, USA
|
|