|
ABSTRACT
We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C's model to analyze the cache misses. Moreover. we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memory-intensive scientific applications - sorting, fast Fourier transform and dense matrix-multiplication. In practice, our cache-efficient algorithms for these applications are able to achieve memory throughput of 30-50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we are able to achieve 2-5 x performance improvement.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
E. Anderson , Z. Bai , C. Bischof , J. Demmel , J. Dongarra , J. Du Croz , A. Greenbaum , S. Hammarling , A. McKenney , S. Ostrouchov , D. Sorensen, LAPACK's user's guide, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1992
|
| |
3
|
Arge, L., Brodal, G., and Fagerberg, R. 2004. Cache oblivious data structures. Handbook on Data Structures and Applications.
|
 |
4
|
|
| |
5
|
Banerjee, U. 1990. Unimodular transformations of double loops. Proc. of the Workshop on Advances in Lanugages and Compilers for Parallel Processing, 192--219.
|
| |
6
|
Batcher, K. 1968. Sorting networks and their applications. In AFIPS Spring Joint Computer Conference.
|
 |
7
|
|
 |
8
|
Ian Buck , Tim Foley , Daniel Horn , Jeremy Sugerman , Kayvon Fatahalian , Mike Houston , Pat Hanrahan, Brook for GPUs: stream computing on graphics hardware, ACM Transactions on Graphics (TOG), v.23 n.3, August 2004
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
Göddeke, D. 2005. GPGPU performance tuning. Tech. rep., University of Dortmund, Germany. http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/.
|
 |
16
|
Naga K. Govindaraju , Brandon Lloyd , Wei Wang , Ming Lin , Dinesh Manocha, Fast computation of database operations using graphics processors, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007594]
|
 |
17
|
|
 |
18
|
Naga Govindaraju , Jim Gray , Ritesh Kumar , Dinesh Manocha, GPUTeraSort: high performance graphics co-processor sorting for large database management, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142511]
|
 |
19
|
|
| |
20
|
Hall, J. D., Carr, N., and Hart, J. 2003. Cache and bandwidth aware matrix multiplication on the GPU. Technical Report UIUCDCS-R-2003-2328, University of Illinois at Urbana-Champaign.
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
 |
24
|
|
 |
25
|
Induprakas Kodukula , Nawaaz Ahmed , Keshav Pingali, Data-centric multi-level blocking, Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, p.346-357, June 16-18, 1997, Las Vegas, Nevada, United States
|
 |
26
|
|
 |
27
|
Monica D. Lam , Edward E. Rothberg , Michael E. Wolf, The cache performance and optimizations of blocked algorithms, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.63-74, April 08-11, 1991, Santa Clara, California, United States
|
 |
28
|
|
| |
29
|
Lastra, A., Lin, M., and Manocha, D. 2004. ACM workshop on general purpose computation on graphics processors.
|
 |
30
|
|
 |
31
|
|
| |
32
|
Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A., and Purcell, T. 2005. A survey of general-purpose computation on graphics hardware.
|
| |
33
|
Timothy J. Purcell , Craig Donner , Mike Cammarano , Henrik Wann Jensen , Pat Hanrahan, Photon mapping on programmable graphics hardware, Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, July 26-27, 2003, San Diego, California
|
| |
34
|
Rumpf, M., and Strzodka, R. 2001. Using graphics cards for quantized FEM computations. In Proc. of IASTED Visualization, Imaging and Image Processing Conference (VIIP'01), 193--202.
|
 |
35
|
|
| |
36
|
Tolimieri, R., An, M., and Lu, C. 1997. Algorithms for Discrete Fourier Transforms and Convolution. Springer.
|
 |
37
|
|
| |
38
|
|
| |
39
|
|
CITED BY 19
|
|
Sergio Romero , Maria A. Trenas , Eladio Gutierrez , Emilio L. Zapata, Locality-improved FFT implementation on a graphics processor, Proceedings of the 7th WSEAS International Conference on Signal Processing, Computational Geometry & Artificial Vision, p.58-63, August 24-26, 2007, Athens, Greece
|
|
|
Shane Ryoo , Christopher I. Rodrigues , Sam S. Stone , Sara S. Baghsorkhi , Sain-Zee Ueng , John A. Stratton , Wen-mei W. Hwu, Program optimization space pruning for a multithreaded gpu, Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, April 05-09, 2008, Boston, MA, USA
|
|
|
Muthu Manikandan Baskaran , Uday Bondhugula , Sriram Krishnamoorthy , J. Ramanujam , Atanas Rountev , P. Sadayappan, A compiler framework for optimization of affine loop nests for gpgpus, Proceedings of the 22nd annual international conference on Supercomputing, June 07-12, 2008, Island of Kos, Greece
|
|
|
|
|
|
|
|
|
Perry H. Wang , Jamison D. Collins , Gautham N. Chinya , Hong Jiang , Xinmin Tian , Milind Girkar , Nick Y. Yang , Guei-Yuan Lueh , Hong Wang, EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system, ACM SIGPLAN Notices, v.42 n.6, June 2007
|
|
|
|
|
|
|
|
|
Shane Ryoo , Christopher I. Rodrigues , Sara S. Baghsorkhi , Sam S. Stone , David B. Kirk , Wen-mei W. Hwu, Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, February 20-23, 2008, Salt Lake City, UT, USA
|
|
|
|
|
|
|
|
|
|
|
|
Mark Silberstein , Assaf Schuster , Dan Geiger , Anjul Patney , John D. Owens, Efficient computation of sum-products on GPUs through software-managed cache, Proceedings of the 22nd annual international conference on Supercomputing, June 07-12, 2008, Island of Kos, Greece
|
|
|
Shane Ryoo , Christopher I. Rodrigues , Sam S. Stone , John A. Stratton , Sain-Zee Ueng , Sara S. Baghsorkhi , Wen-mei W. Hwu, Program optimization carving for GPU computing, Journal of Parallel and Distributed Computing, v.68 n.10, p.1389-1401, October, 2008
|
|
|
|
|
|
|
|
|
Naga K. Govindaraju , Brandon Lloyd , Yuri Dotsenko , Burton Smith , John Manferdelli, High performance discrete Fourier transforms on graphics processors, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15-21, 2008, Austin, Texas
|
|
|
|
|
|
|
|