ACM Home Page
Please provide us with feedback. Feedback
Automatic nonblocking communication for partitioned global address space programs
Full text PdfPdf (442 KB)
Source
International Conference on Supercomputing archive
Proceedings of the 21st annual international conference on Supercomputing table of contents
Seattle, Washington
SESSION: Message passing systems table of contents
Pages: 158 - 167  
Year of Publication: 2007
ISBN:978-1-59593-768-1
Authors
Wei-Yu Chen  University of California at Berkeley and Lawrence Berkeley National Laboratory
Dan Bonachea  University of California at Berkeley and Lawrence Berkeley National Laboratory
Costin Iancu  Lawrence Berkeley National Laboratory
Katherine Yelick  University of California at Berkeley and Lawrence Berkeley National Laboratory
Sponsor
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 61,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1274971.1274995
What is a DOI?

ABSTRACT

Overlapping communication with computation is an important optimization on current cluster architectures; its importance is likely to increase as the doubling of processing power far outpaces any improvements in communication latency. PGAS languages offer unique opportunities for communication overlap, because their one-sided communication model enables low overhead data transfer. Recent results have shown the value of hiding latency by manually applying language-level nonblocking data transfer routines, but this process can be both tedious and error-prone. In this paper, we present a runtime framework that automatically schedules the data transfers to achieve overlap. The optimization framework is entirely transparent to the user, and aggressively reorders and aggregates both remote puts and gets. We preserve correctness via runtime conflict checks and temporary buffers, using several techniques to lower the overhead. Experimental results on application benchmarks suggest that our framework can be very effective at hiding communication latency on clusters, improving performance over the blocking code by an average of 16% for some of the NAS Parallel Benchmarks, 48% for GUPS, and over 25% for a multi-block fluid dynamics solver. While the system is not yet as effective as aggressive manual optimization, it increases programmers' productivity by freeing them from the details of communication management.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, Fall 1991.
 
2
 
3
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing Bandwidth Limited Problems Using One-sided Communication and Overlap. In 20th International Parallel and Distributed Processing Symposium (IPDPS), 2006.
 
4
K. Berlin, J. Huan, M. Jacob, et al. Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures. In 16th International Workshop on Languages and Compilers for Parallel Processing (LCPC), October 2003.
 
5
 
6
D. Bonachea. Proposal for Extending the UPC Memory Copy Library Functions and Supporting Extensions to GASNet. Technical Report LBNL-56495, Lawrence Berkeley National Lab, October 2004.
7
8
 
9
CHARM++ project web page. Available at http://charm.cs.uiuc.edu.
10
 
11
 
12
 
13
14
 
15
Compaq UPC version 2.0 for Tru64 UNIX. http://h30097.www3.hp.com/upc/.
16
 
17
K. Datta, D. Bonachea, and K. Yelick. Titanium Performance and Potential: an NPB Experimental Study. In 18th International Workshop on Languages and Compilers for Parallel Processing (LCPC), 2005.
 
18
 
19
A. Faraj and X. Yuan. Communication characteristics in the NAS parallel benchmarks. In 14th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2002), November 2002.
 
20
 
21
 
22
L. Iftode and J. P. Singh. Shared Virtual Memory: Progress and Challenges. Proc. of the IEEE, Special Issue on Distributed Shared Memory, 87(3):498--507, 1999.
 
23
Jacob Sorensen and Scott Baden. A Data Driven Model for Tolerating Communication Delays. In Proceedings of the 12th SIAM Conference on Parallel Processing for Scientific Computing, 2006.
 
24
 
25
 
26
27
 
28
R. Numrich and J. Reid. Co-Array Fortran for Parallel Programming. Technical Report RAL-TR-1998-060, Rutherford Appleton Laboratory, 1998.
29
 
30
POSIX standard. http://www.opengroup.org/onlinepubs/009695399/.
 
31
J. Savant and S. Seidel. MuPC: A Run Time System for Unified Parallel C. Technical Report CS-TR-02-03, Department of Computer Science, Michigan Techincal University, September 2002.
32
 
33
J. Su and K. Yelick. Array Prefetching for Irregular Array Accesses in Titanium. In Sixth Annual Workshop on Java for Parallel and Distributed Computing, 2004.
 
34
UPC language specifications, v1.2. Technical Report LBNL-59208, Berkeley National Lab, 2005.
 
35
K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance Java dialect. Concurrency: Practice and Experience, 10:825--836, 1998.
 
36
37


Collaborative Colleagues:
Wei-Yu Chen: colleagues
Dan Bonachea: colleagues
Costin Iancu: colleagues
Katherine Yelick: colleagues