| Program optimization space pruning for a multithreaded gpu |
| Full text |
Pdf
(523 KB)
|
Source
|
Code Generation and Optimization
archive
Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization
table of contents
Boston, MA, USA
SESSION: Compiling for multicore and multithreading
table of contents
Pages 195-204
Year of Publication: 2008
ISBN:978-1-59593-978-4
|
|
Authors
|
|
Shane Ryoo
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
Christopher I. Rodrigues
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
Sam S. Stone
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
Sara S. Baghsorkhi
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
Sain-Zee Ueng
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
John A. Stratton
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
Wen-mei W. Hwu
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 43, Downloads (12 Months): 317, Citation Count: 12
|
|
|
ABSTRACT
Program optimization for highly-parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly-parallel applications for these platforms, who lack the substantial experience and knowledge needed to maximize their performance. This creates a need for more structured optimization methods with means to estimate their performance effects. Furthermore these methods need to be understandable by most programmers. This paper shows the complexity involved in optimizing applications for one such system and one relatively simple methodology for reducing the workload involved in the optimization process. This work is based on one such highly-parallel system, the GeForce 8800 GTX using CUDA. Its flexible allocation of resources to threads allows it to extract performance from a range of applications with varying resource requirements, but places new demands on developers who seek to maximize an application's performance. We show how optimizations interact with the architecture in complex ways, initially prompting an inspection of the entire configuration space to find the optimal configuration. Even for a seemingly simple application such as matrix multiplication, the optimal configuration can be unexpected. We then present metrics derived from static code that capture the first-order factors of performance. We demonstrate how these metrics can be used to prune many optimization configurations, down to those that lie on a Pareto-optimal curve. This reduces the optimization space by as much as 98% and still finds the optimal configuration for each of the studied applications.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
NVIDIA CUDA. http://www.nvidia.com/cuda.
|
| |
2
|
SPIRAL project. http://spiral.net.
|
| |
3
|
F. Agakov , E. Bonilla , J. Cavazos , B. Franke , G. Fursin , M. F. P. O'Boyle , J. Thomson , M. Toussaint , C. K. I. Williams, Using Machine Learning to Focus Iterative Optimization, Proceedings of the International Symposium on Code Generation and Optimization, p.295-305, March 26-29, 2006
[doi> 10.1109/CGO.2006.37]
|
 |
4
|
L. Almagor , Keith D. Cooper , Alexander Grosul , Timothy J. Harvey , Steven W. Reeves , Devika Subramanian , Linda Torczon , Todd Waterman, Finding effective compilation sequences, Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, June 11-13, 2004, Washington, DC, USA
|
 |
5
|
|
| |
6
|
W. Blume et al. Polaris: The next generation in parallelizing compilers. Technical Report 1375, University of Illinois at Urbana--Champaign, 1994.
|
| |
7
|
I. Buck. Brook Specification v0.2, October 2003.
|
 |
8
|
|
 |
9
|
Somnath Ghosh , Margaret Martonosi , Sharad Malik, Precise miss analysis for program transformations with caches of arbitrary associativity, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.228-239, October 02-07, 1998, San Jose, California, United States
|
 |
10
|
|
| |
11
|
Mary W. Hall , Jennifer M. Anderson , Saman P. Amarasinghe , Brian R. Murphy , Shih-Wei Liao , Edouard Bugnion , Monica S. Lam, Maximizing Multiprocessor Performance with the SUIF Compiler, Computer, v.29 n.12, p.84-89, December 1996
[doi> 10.1109/2.546613]
|
| |
12
|
H. Han, G. Rivera, and C.--W. Tseng. Software support for improving locality in scientific codes. In 8th Workshop on Compilers for Parallel Computers, January 2000.
|
| |
13
|
|
| |
14
|
|
| |
15
|
D. Jimenez--Gonzalez, X. Martorell, and A. Ramirez. Performance analysis of Cell Broadband Engine for high memory bandwidth applications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, pages 210--219, April 2007.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
 |
19
|
Monica D. Lam , Edward E. Rothberg , Michael E. Wolf, The cache performance and optimizations of blocked algorithms, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.63-74, April 08-11, 1991, Santa Clara, California, United States
|
| |
20
|
J. Nickolls and I. Buck. NVIDIA CUDA software and GPU parallel computing architecture. Microprocessor Forum, May 2007.
|
| |
21
|
NVIDIA Corporation. CUDA Programming Guide, February 2007.
|
 |
22
|
Shane Ryoo , Christopher I. Rodrigues , Sara S. Baghsorkhi , Sam S. Stone , David B. Kirk , Wen-mei W. Hwu, Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, February 20-23, 2008, Salt Lake City, UT, USA
[doi> 10.1145/1345206.1345220]
|
| |
23
|
J. E. Stone et al. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28(16):2618--2640, December 2007.
|
| |
24
|
S. Stone et al. How GPUs can improve the quality of magnetic resonance imaging. The First Workshop on General Purpose Processing on Graphics Processing Units, October 2007.
|
 |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
|
 |
29
|
|
CITED BY 12
|
|
Christopher I. Rodrigues , David J. Hardy , John E. Stone , Klaus Schulten , Wen-Mei W. Hwu, GPU acceleration of cutoff pair potentials for molecular modeling applications, Proceedings of the 2008 conference on Computing frontiers, May 05-07, 2008, Ischia, Italy
|
|
|
Muthu Manikandan Baskaran , Uday Bondhugula , Sriram Krishnamoorthy , J. Ramanujam , Atanas Rountev , P. Sadayappan, A compiler framework for optimization of affine loop nests for gpgpus, Proceedings of the 22nd annual international conference on Supercomputing, June 07-12, 2008, Island of Kos, Greece
|
|
|
|
|
|
Samuel S. Stone , Justin P. Haldar , Stephanie C. Tsao , Wen-mei W. Hwu , Zhi-Pei Liang , Bradley P. Sutton, Accelerating advanced mri reconstructions on gpus, Proceedings of the 2008 conference on Computing frontiers, May 05-07, 2008, Ischia, Italy
|
|
|
S. S. Stone , J. P. Haldar , S. C. Tsao , W. -m. W. Hwu , B. P. Sutton , Z. -P. Liang, Accelerating advanced MRI reconstructions on GPUs, Journal of Parallel and Distributed Computing, v.68 n.10, p.1307-1318, October, 2008
|
|
|
Jay L.T. Cornwall , Lee Howes , Paul H.J. Kelly , Phil Parsonage , Bruno Nicoletti, High-performance SIMT code generation in an active visual effects library, Proceedings of the 6th ACM conference on Computing frontiers, May 18-20, 2009, Ischia, Italy
|
|
|
|
|
|
Byunghyun Jang , Synho Do , Homer Pien , David Kaeli, Architecture-aware optimization targeting multithreaded stream computing, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, p.62-70, March 08-08, 2009, Washington, D.C.
|
|
|
|
|
|
|
|
|
|
|
|
|
|