ACM Home Page
Please provide us with feedback. Feedback
Benchmarking GPUs to tune dense linear algebra
Full text PdfPdf (527 KB)
Source Conference on High Performance Networking and Computing archive
Proceedings of the 2008 ACM/IEEE conference on Supercomputing - Volume 00 table of contents
Austin, Texas
SECTION: Papers table of contents
Article No. 31  
Year of Publication: 2008
ISBN:978-1-4244-2835-9
Authors
Vasily Volkov  University of California at Berkeley
James W. Demmel  University of California at Berkeley
Publisher
IEEE Press  Piscataway, NJ, USA
Bibliometrics
Downloads (6 Weeks): 94,   Downloads (12 Months): 916,   Citation Count: 8
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  

ABSTRACT

We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor's implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80--90% of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~540 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU. This study includes detailed benchmarking of the GPU memory system that reveals sizes and latencies of caches and TLB. We present a couple of algorithmic optimizations aimed at increasing parallelism and regularity in the problem that provide us with slightly higher performance.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
AMD. 2006. ATI CTM Guide, version 1.01.
 
5
 
6
Anderson, E., Brandt, M., and Yang, C. 2004. LINPACK Benchmark Optimizations on a Virtual Processor Grid, In Cray User Group 2004 Proceedings.
 
7
Baboulin, M., Dongarra J., and Tomov, S. 2008. Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures, Technical Report UT-CS-08-200, University of Tennessee, May 6, 2008 (also LAPACK Working Note 200).
 
8
Barrachina, S., Castillo, M., Igual, F. D., Mayo, R, and Quintana-Orti, E. S. 2008. Solving Dense Linear Systems on Graphics Processors, Technical Report ICC 02-02-2008, Universidad Jaime I, February 2008.
9
 
10
 
11
Castillo, M., Chan, E., Igual, F. D., Mayo, R., Quintana-Orti, E. S., Quintana-Orti, G., van de Geijn, R., and Van Zee, F. G. 2008. Making Programming Synonymous with Programming for Linear Algebra Libraries, FLAME Working Note #31. The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-08-20, April 17, 2008.
 
12
 
13
14
 
15
 
16
17
18
19
 
20
Hwu, W. W., and Kirk, D. 2007. ECE 498 AL1: Programming Massively Parallel Processors, Lecture Slides, University of Illinois, Urbana-Champaign.
 
21
NVIDIA. 2006. NVIDIA GeForce 8800 GPU Architecture Overview, Technical Brief, November 2006.
 
22
NVIDIA. 2008a. NVIDIA CUDA Compute Unified Device Architecture, Programming Guide, Version 2.0.
 
23
NVIDIA. 2008b. NVIDIA GeForce GTX 200 GPU Architectural Overview, Technical Brief, May 2008.
 
24
Quintana-Orti, G., Igual, F. D., Quintana-Orti, E. S., and van de Geijn, R. 2008. Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators, FLAME Working Note #32. The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-08-22. May 9, 2008.
25

CITED BY  8

Collaborative Colleagues:
Vasily Volkov: colleagues
James W. Demmel: colleagues