| Chunking parallel loops in the presence of synchronization |
| Full text |
Pdf
(571 KB)
|
Source
|
International Conference on Supercomputing
archive
Proceedings of the 23rd international conference on Supercomputing
table of contents
Yorktown Heights, NY, USA
SESSION: Compilers
table of contents
Pages 181-192
Year of Publication: 2009
ISBN:978-1-60558-498-0
|
|
Authors
|
|
Jun Shirako
|
Rice University, Houston, TX, USA
|
|
Jisheng M. Zhao
|
Rice University, Houston, TX, USA
|
|
V. Krishna Nandivada
|
IBM India Research Laboratory, Bangalore, India
|
|
Vivek N. Sarkar
|
Rice University, Houston, TX, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 28, Downloads (12 Months): 79, Citation Count: 0
|
|
|
ABSTRACT
Modern languages for shared-memory parallelism are moving from a bulk-synchronous Single Program Multiple Data (SPMD) execution model to lightweight Task Parallel execution models for improved productivity. This shift is intended to encourage programmers to express the ideal parallelism in an application at a fine granularity that is natural for the underlying domain, while delegating to the compiler and runtime system the job of extracting coarser-grained useful parallelism for a given target system. A simple and important example of this separation of concerns between ideal and useful parallelism can be found in chunking of parallel loops, where the programmer expresses ideal parallelism by declaring all iterations of a loop to be parallel and the implementation exploits useful parallelism by executing iterations of the loop in sequential chunks. Though chunking of parallel loops has been used as a standard transformation for several years, it poses some interesting challenges when the parallel loop may directly or indirectly (via procedure calls) perform synchronization operations such as barrier, signal or wait statements. In such cases, a straightforward transformation that attempts to execute a chunk of loops in sequence in a single thread may violate the semantics of the original parallel program. In this paper, we address the problem of chunking parallel loops that may contain synchronization operations. We present a transformation framework that uses a combination of transformations from past work (e.g., loop strip-mining, interchange, distribution, unswitching) to obtain an equivalent set of parallel loops that chunk together statements from multiple iterations while preserving the semantics of the original parallel program. These transformations result in reduced synchronization and scheduling overheads, thereby improving performance and scalability. Our experimental results for 11 benchmark programs on an UltraSPARC II multicore processor showed a geometric mean speedup of 0.52x for the unchunked case and 9.59x for automatic chunking using the techniques described in this paper. This wide gap underscores the importance of using these techniques in future compiler and runtime systems for programming models with lightweight parallelism.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
E. Allan et al. The Fortress language specification version 0.618. Technical report, Sun Microsystems, April 2005.
|
| |
2
|
R. Barik et al. Experiences with an SMP implementation for X10 based on the Java concurrency utilities. In Proceedings of the Workshop on Programming Models for Ubiquitous Parallelism, Seattle, Washington, 2006.
|
 |
3
|
Ganesh Bikshandi , Jose G. Castanos , Sreedhar B. Kodali , V. Krishna Nandivada , Igor Peshansky , Vijay A. Saraswat , Sayantan Sur , Pradeep Varma , Tong Wen, Efficient, portable implementation of asynchronous multi-place programs, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
 |
4
|
Robert D. Blumofe , Christopher F. Joerg , Bradley C. Kuszmaul , Charles E. Leiserson , Keith H. Randall , Yuli Zhou, Cilk: an efficient multithreaded runtime system, Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.207-216, July 19-21, 1995, Santa Barbara, California, United States
|
 |
5
|
Philippe Charles , Christian Grothoff , Vijay Saraswat , Christopher Donawa , Allan Kielstra , Kemal Ebcioglu , Christoph von Praun , Vivek Sarkar, X10: an object-oriented approach to non-uniform cluster computing, Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, October 16-20, 2005, San Diego, CA, USA
|
| |
6
|
F. Darema et al. A Single-Program-Multiple-Data Computational model for EPEX/FORTRAN. Parallel Computing, 7(1):11--24, 1988.
|
| |
7
|
S. Deitz. Parallel programming in chapel. http://www.cct.lsu.edu/ estrabd/LACSI2006/Programming 2006.
|
 |
8
|
|
 |
9
|
|
| |
10
|
B. Goetz. Java Concurrency In Practice. Addison-Wesley, 2007.
|
 |
11
|
|
| |
12
|
|
| |
13
|
Cray Inc. The Chapel language specification version 0.4. Technical report, Cray Inc., February 2005.
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
OpenMP Application Program Interface, version 3.0, May 2008. http://www.openmp.org/mp-documents/spec30.pdf.
|
| |
18
|
|
 |
19
|
|
 |
20
|
|
 |
21
|
|
| |
22
|
R. Vallée-Rai et al. Soot-a Java Optimization Framework. In Proceedings of CASCON 1999, pages 125--135, 1999.
|
| |
23
|
|
| |
24
|
X10 release on SourceForge. http://x10.sf.net.
|
 |
25
|
Katherine Yelick , Dan Bonachea , Wei-Yu Chen , Phillip Colella , Kaushik Datta , Jason Duell , Susan L. Graham , Paul Hargrove , Paul Hilfinger , Parry Husbands , Costin Iancu , Amir Kamil , Rajesh Nishtala , Jimmy Su , Michael Welcome , Tong Wen, Productivity and performance using partitioned global address space languages, Proceedings of the 2007 international workshop on Parallel symbolic computation, July 27-28, 2007, London, Ontario, Canada
[doi> 10.1145/1278177.1278183]
|
|