|
ABSTRACT
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in a grid environment. To automate this process on shared memory systems, we establish a performance model using NVIDIA's Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 98% of the optimal speedup.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Gabrielle Allen , Thomas Dramlitsch , Ian Foster , Nicholas T. Karonis , Matei Ripeanu , Edward Seidel , Brian Toonen, Supporting efficient execution in heterogeneous distributed computing environments with cactus and globus, Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM), p.52-52, November 10-16, 2001, Denver, Colorado
[doi> 10.1145/582034.582086]
|
| |
2
|
M. Alpert. Not just fun and games. April 1999.
|
 |
3
|
Mark Bromley , Steven Heller , Tim McNerney , Guy L. Steele, Jr., Fortran at ten gigaflops: the connection machine convolution compiler, Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation, p.145-156, June 24-28, 1991, Toronto, Ontario, Canada
|
 |
4
|
|
| |
5
|
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A performance study of general purpose applications on graphics processors using CUDA, June 2008.
|
| |
6
|
NVIDIA Corporation. Geforce gtx 280 specifications. 2008.
|
| |
7
|
NVIDIA Corporation. NVIDIA CUDA visual profiler. June 2008.
|
| |
8
|
L. Dagum. OpenMP: A proposed industry standard API for shared memory programming, October 1997.
|
| |
9
|
Kaushik Datta , Mark Murphy , Vasily Volkov , Samuel Williams , Jonathan Carter , Leonid Oliker , David Patterson , John Shalf , Katherine Yelick, Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15-21, 2008, Austin, Texas
|
 |
10
|
|
| |
11
|
L. C. Evans. Partial Differential Equations. American Mathematical Society, 1998.
|
| |
12
|
|
 |
13
|
|
| |
14
|
N. Goodnight. CUDA/OpenGL fluid simulation, April 2007.
|
 |
15
|
|
| |
16
|
|
 |
17
|
Wei Huang , Mircea R. Stan , Kevin Skadron , Karthik Sankaranarayanan , Shougata Ghosh , Sivakumar Velusam, Compact thermal modeling for temperature-aware design, Proceedings of the 41st annual Design Automation Conference, June 07-11, 2004, San Diego, CA, USA
[doi> 10.1145/996566.996800]
|
| |
18
|
W. Jalby and U. Meier. Optimizing matrix operations on a parallel multiprocessor with a hierarchical memory system. pages 429--432, 1986.
|
 |
19
|
Shoaib Kamil , Parry Husbands , Leonid Oliker , John Shalf , Katherine Yelick, Impact of modern memory subsystems on cache optimizations for stencil computations, Proceedings of the 2005 workshop on Memory system performance, June 12-12, 2005, Chicago, Illinois
[doi> 10.1145/1111583.1111589]
|
| |
20
|
|
 |
21
|
Sriram Krishnamoorthy , Muthu Baskaran , Uday Bondhugula , J. Ramanujam , Atanas Rountev , P Sadayappan, Effective automatic parallelization of stencil computations, Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, June 10-13, 2007, San Diego, California, USA
|
| |
22
|
|
 |
23
|
|
| |
24
|
J. Lin, H. Zheng, Z. Zhu, Z. Zhang, and H. David. Dram-level prefetching for fully-buffered dimm: Design, performance and power saving. ISPASS'07, 2007.
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
| |
28
|
J. Ramanujam. Tiling of iteration spaces for multicomputers. In Proc. Int. Conf. Parallel Processing, pages 179--186, 1990.
|
| |
29
|
L. Renganarayana, M. Harthikote-Matha, R. Dewri, and S. Rajopadhye. Towards optimal multi-level tiling for stencil computations. IPDPS'07, pages 1--10, March 2007.
|
| |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
Sain-Zee Ueng , Melvin Lathara , Sara S. Baghsorkhi , Wen-Mei W. Hwu, CUDA-Lite: Reducing GPU Programming Complexity, Languages and Compilers for Parallel Computing: 21th International Workshop, LCPC 2008, Edmonton, Canada, July 31 - August 2, 2008, Revised Selected Papers, Springer-Verlag, Berlin, Heidelberg, 2008
[doi> 10.1007/978-3-540-89740-8_1]
|
| |
34
|
|
| |
35
|
|
| |
36
|
|
|