ACM Home Page
Please provide us with feedback. Feedback
A memory model for scientific algorithms on graphics processors
Full text HtmlHtml (2 KB),  PdfPdf (383 KB)
Source Conference on High Performance Networking and Computing archive
Proceedings of the 2006 ACM/IEEE conference on Supercomputing table of contents
Tampa, Florida
SESSION: Technical papers table of contents
Article No. 89  
Year of Publication: 2006
ISBN:0-7695-2700-0
Authors
Naga K. Govindaraju  UNC Chapel Hill and Microsoft Corporation
Scott Larsen  UNC Chapel Hill
Jim Gray  Microsoft Corporation
Dinesh Manocha  UNC Chapel Hill
Sponsors
IEEE : Institute of Electrical and Electronics Engineers
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 33,   Downloads (12 Months): 203,   Citation Count: 19
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1188455.1188549
What is a DOI?

ABSTRACT

We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C's model to analyze the cache misses. Moreover. we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memory-intensive scientific applications - sorting, fast Fourier transform and dense matrix-multiplication. In practice, our cache-efficient algorithms for these applications are able to achieve memory throughput of 30-50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we are able to achieve 2-5 x performance improvement.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
Arge, L., Brodal, G., and Fagerberg, R. 2004. Cache oblivious data structures. Handbook on Data Structures and Applications.
4
 
5
Banerjee, U. 1990. Unimodular transformations of double loops. Proc. of the Workshop on Advances in Lanugages and Compilers for Parallel Processing, 192--219.
 
6
Batcher, K. 1968. Sorting networks and their applications. In AFIPS Spring Joint Computer Conference.
7
8
 
9
10
 
11
12
 
13
 
14
 
15
Göddeke, D. 2005. GPGPU performance tuning. Tech. rep., University of Dortmund, Germany. http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/.
16
17
18
19
 
20
Hall, J. D., Carr, N., and Hart, J. 2003. Cache and bandwidth aware matrix multiplication on the GPU. Technical Report UIUCDCS-R-2003-2328, University of Illinois at Urbana-Champaign.
 
21
 
22
 
23
24
25
26
27
28
 
29
Lastra, A., Lin, M., and Manocha, D. 2004. ACM workshop on general purpose computation on graphics processors.
30
31
 
32
Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A., and Purcell, T. 2005. A survey of general-purpose computation on graphics hardware.
 
33
 
34
Rumpf, M., and Strzodka, R. 2001. Using graphics cards for quantized FEM computations. In Proc. of IASTED Visualization, Imaging and Image Processing Conference (VIIP'01), 193--202.
35
 
36
Tolimieri, R., An, M., and Lu, C. 1997. Algorithms for Discrete Fourier Transforms and Convolution. Springer.
37
 
38
 
39

CITED BY  19

Collaborative Colleagues:
Naga K. Govindaraju: colleagues
Scott Larsen: colleagues
Jim Gray: colleagues
Dinesh Manocha: colleagues