|
ABSTRACT
Optimizing program execution targeted for Graphics Processing Units (GPUs) can be very challenging. Our ability to efficiently map serial code to a GPU or stream processing platform is a time consuming task and is greatly hampered by a lack of detail about the underlying hardware. Programmers are left to attempt trial and error to produce optimized codes. Recent publication of the underlying instruction set architecture (ISA) of the AMD/ATI GPU has allowed researchers to begin to propose aggressive optimizations. In this work, we present an optimization methodology that utilizes this information to accelerate programs on AMD/ATI GPUs. We start by defining optimization spaces that guide our work. We begin with disassembled machine code and collect program statistics provided by the AMD Graphics Shader Analyzer (GSA) profiling toolset. We explore optimizations targeting three different computing resources: 1) ALUs, 2) fetch bandwidth, and 3) thread usage, and present optimization techniques that consider how to better utilize each resource. We demonstrate the effectiveness of our proposed optimization approach on an AMD Radeon HD3870 GPU using the Brook+ stream programming language. We describe our optimizations using two commonly-used GPGPU applications that present very different program characteristics and optimization spaces: matrix multiplication and back-projection for medical image reconstruction. Our results show that optimized code can improve performance by 1.45x--6.7x as compared to unoptimized code run on the same GPU platform. The speedup obtained with our optimized implementations are 882x (matrix multiply) and 19x (back-projection) faster as compared with serial implementations run on an Intel 2.66 GHz Core 2 Duo with a 2 GB main memory.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
AMD. Brook+ Programming Guide, V 1.1 Beta, Brook+ SDK.
|
| |
2
|
AMD. R600 Assembly Language Document, Brook+ SDK, 2007.
|
| |
3
|
AMD. R600-Family Instruction Set Architecture, Revision 0.31, 2007.
|
| |
4
|
AMD. HW Guide, Brook+ SDK, 2008.
|
| |
5
|
A. Andersen and A. Kak. Simultaneous algebraic reconstruction technique (SART): a superior implementation of the art algorithm. Ultrason Imaging, 6(1):81--94, 1984.
|
 |
6
|
Ian Buck , Tim Foley , Daniel Horn , Jeremy Sugerman , Kayvon Fatahalian , Mike Houston , Pat Hanrahan, Brook for GPUs: stream computing on graphics hardware, ACM SIGGRAPH 2004 Papers, August 08-12, 2004, Los Angeles, California
|
| |
7
|
S. Do, Z. Liang, W. Karl, T. Brady, and H. Pien. A projection-driven pre-correction technique for iterative reconstruction of helical cone-beam cardiac CT images. In Proceedings of SPIE, volume 6913, page 69132U. SPIE, 2008.
|
 |
8
|
|
| |
9
|
GPGPU Website. www.gpgpu.org.
|
| |
10
|
|
 |
11
|
David Luebke , Mark Harris , Jens Krüger , Tim Purcell , Naga Govindaraju , Ian Buck , Cliff Woolley , Aaron Lefohn, GPGPU: general purpose computation on graphics hardware, ACM SIGGRAPH 2004 Course Notes, p.33-es, August 08-12, 2004, Los Angeles, CA
[doi> 10.1145/1103900.1103933]
|
| |
12
|
J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. GPU Computing. In Proceedings of the IEEE, volume 96, pages 879--899, 2008.
|
 |
13
|
Shane Ryoo , Christopher I. Rodrigues , Sam S. Stone , Sara S. Baghsorkhi , Sain-Zee Ueng , John A. Stratton , Wen-mei W. Hwu, Program optimization space pruning for a multithreaded gpu, Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, April 05-09, 2008, Boston, MA, USA
[doi> 10.1145/1356058.1356084]
|
 |
14
|
Mark Silberstein , Assaf Schuster , Dan Geiger , Anjul Patney , John D. Owens, Efficient computation of sum-products on GPUs through software-managed cache, Proceedings of the 22nd annual international conference on Supercomputing, June 07-12, 2008, Island of Kos, Greece
[doi> 10.1145/1375527.1375572]
|
| |
15
|
J. B. Thibault, K. D. Sauer, C. A. Bouman, and J. Hsieh. A Three-dimensional Statistical Approach to Improved Image Quality for Multislice Helical CT. Med. Physics, 34(11):4526--44, 2007.
|
|