| Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors |
| Full text |
Pdf
(545 KB)
|
Source
|
Principles and Practice of Parallel Programming
archive
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
table of contents
Raleigh, NC, USA
SESSION: Parallel compilers and tools
table of contents
Pages 219-228
Year of Publication: 2009
ISBN:978-1-60558-397-6
Also published in ...
|
|
Authors
|
|
Muthu Manikandan Baskaran
|
The Ohio state University, Columbus, OH, USA
|
|
Nagavijayalakshmi Vydyanathan
|
The Ohio state University, Columbus, OH, USA
|
|
Uday Kumar Reddy Bondhugula
|
The Ohio state University, Columbus, OH, USA
|
|
J. Ramanujam
|
Louisiana State University, Baton Rouge, LA, USA
|
|
Atanas Rountev
|
The Ohio state University, Columbus, OH, USA
|
|
P. Sadayappan
|
The Ohio state University, Columbus, OH, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 37, Downloads (12 Months): 263, Citation Count: 0
|
|
|
ABSTRACT
Recent advances in polyhedral compilation technology have made it feasible to automatically transform affine sequential loop nests for tiled parallel execution on multi-core processors. However, for multi-statement input programs with statements of different dimensionalities, such as Cholesky or LU decomposition, the parallel tiled code generated by existing automatic parallelization approaches may suffer from significant load imbalance, resulting in poor scalability on multi-core systems. In this paper, we develop a completely automatic parallelization approach for transforming input affine sequential codes into efficient parallel codes that can be executed on a multi-core system in a load-balanced manner. In our approach, we employ a compile-time technique that enables dynamic extraction of inter-tile dependences at run-time, and dynamic scheduling of the parallel tiles on the processor cores for improved scalable execution. Our approach obviates the need for programmer intervention and re-writing of existing algorithms for efficient parallel execution on multi-cores. We demonstrate the usefulness of our approach through comparisons using linear algebra computations: LU and Cholesky decomposition.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
 |
3
|
|
| |
4
|
|
| |
5
|
C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam. Putting polyhedral loop transformations to work. In Workshop on Languages and Compilers for Parallel Computing (LCPC'03), pages 23--30, 2003.
|
| |
6
|
U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Affine transformations for communication minimal parallelization and locality optimization of arbitrarily nested loop sequences. Technical Report OSU-CISRC-5/07-TR43, Ohio State University, May 2007.
|
| |
7
|
U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction (ETAPS CC), Apr. 2008.
|
 |
8
|
Uday Bondhugula , Albert Hartono , J. Ramanujam , P. Sadayappan, A practical automatic polyhedral parallelizer and locality optimizer, Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation, June 07-13, 2008, Tucson, AZ, USA
|
| |
9
|
U. Bondhugula, J. Ramanujam, and P. Sadayappan. Pluto: A practical and fully automatic polyhedral parallelizer and locality optimizer. Technical Report OSU-CISRC-10/07-TR70, The Ohio State University, Oct. 2007.
|
| |
10
|
|
| |
11
|
A. Buttari, J. Dongarra, P. Husbands, J. Kurzak, and K. Yelick. Multithreading for synchronization tolerance in matrix factorization. In Proceedings of the SciDAC 2007 Conference. Journal of Physics: Conference Series, 2007.
|
| |
12
|
A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Technical Report UT-CS-07-600, Innovative Computing Laboratory, University of Tennessee Knoxville, September 2007. Submitted to Parallel Computing. LAPACK Working Note 191.
|
| |
13
|
|
 |
14
|
|
| |
15
|
CLooG: The Chunky Loop Generator. http://www.cloog.org.
|
| |
16
|
A. Darte, G.-A. Silber, and F. Vivien. Combining retiming and scheduling techniques for loop parallelization and loop tiling. Parallel Processing Letters, 7(4):379--392, 1997.
|
| |
17
|
|
| |
18
|
J. Dongarra. Four important concepts that will effect math software. In 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA'08), 2008.
|
| |
19
|
P. Feautrier. Dataflow analysis of array and scalar references. IJPP, 20(1):23--53, 1991.
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
Sylvain Girbal , Nicolas Vasilache , Cédric Bastoul , Albert Cohen , David Parello , Marc Sigler , Olivier Temam, Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies, International Journal of Parallel Programming, v.34 n.3, p.261-317, June 2006
[doi> 10.1007/s10766-006-0012-3]
|
| |
25
|
M. Griebl. Automatic Parallelization of Loop Programs for Distributed Memory Architectures. FMI, University of Passau, 2004. Habilitation Thesis.
|
| |
26
|
|
 |
27
|
|
 |
28
|
|
| |
29
|
|
 |
30
|
|
 |
31
|
|
| |
32
|
|
| |
33
|
Parallel linear algebra for scalable multi-core architectures (PLASMA) project. http://icl.cs.utk.edu/plasma.
|
| |
34
|
PLUTO: A polyhedral automatic parallelizer and locality optimizer for multicores. http://pluto-compiler.sourceforge.net.
|
 |
35
|
|
 |
36
|
|
| |
37
|
|
 |
38
|
Carlos García Quiñones , Carlos Madriles , Jesús Sánchez , Pedro Marcuello , Antonio González , Dean M. Tullsen, Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices, Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, June 12-15, 2005, Chicago, IL, USA
|
 |
39
|
|
| |
40
|
P. Rundberg and P. S. Om. Low-cost thread-level data dependence speculation on multiprocessors. In In Fourth Workshop on Multithreaded Execution, Architecture and Compilation, pages 1--9, 2000.
|
 |
41
|
|
| |
42
|
|
| |
43
|
|
| |
44
|
|
 |
45
|
|
| |
46
|
N. Vasilache, C. Bastoul, and A. Cohen. Polyhedral code generation in the real world. In International Conference on Compiler Construction (ETAPS CC'06), pages 185--201, Mar. 2006.
|
 |
47
|
Nicolas Vasilache , Cedric Bastoul , Albert Cohen , Sylvain Girbal, Violated dependence analysis, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
[doi> 10.1145/1183401.1183448]
|
| |
48
|
|
|