|
ABSTRACT
We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor's implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80--90% of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~540 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU. This study includes detailed benchmarking of the GPU memory system that reveals sizes and latencies of caches and TLB. We present a couple of algorithmic optimizations aimed at increasing parallelism and regularity in the problem that provide us with slightly higher performance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Dennis Abts , Abdulla Bataineh , Steve Scott , Greg Faanes , Jim Schwarzmeier , Eric Lundberg , Tim Johnson , Mike Bye , Gerald Schwoerer, The Cray BlackWidow: a highly scalable vector multiprocessor, Proceedings of the 2007 ACM/IEEE conference on Supercomputing, November 10-16, 2007, Reno, Nevada
[doi> 10.1145/1362622.1362646]
|
 |
2
|
|
 |
3
|
Robert Alverson , David Callahan , Daniel Cummings , Brian Koblenz , Allan Porterfield , Burton Smith, The Tera computer system, Proceedings of the 4th international conference on Supercomputing, p.1-6, June 11-15, 1990, Amsterdam, The Netherlands
|
| |
4
|
AMD. 2006. ATI CTM Guide, version 1.01.
|
| |
5
|
E. Anderson , Z. Bai , J. Dongarra , A. Greenbaum , A. McKenney , J. Du Croz , S. Hammarling , J. Demmel , C. Bischof , D. Sorensen, LAPACK: a portable linear algebra library for high-performance computers, Proceedings of the 1990 ACM/IEEE conference on Supercomputing, p.2-11, November 12-16, 1990, New York, New York
|
| |
6
|
Anderson, E., Brandt, M., and Yang, C. 2004. LINPACK Benchmark Optimizations on a Virtual Processor Grid, In Cray User Group 2004 Proceedings.
|
| |
7
|
Baboulin, M., Dongarra J., and Tomov, S. 2008. Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures, Technical Report UT-CS-08-200, University of Tennessee, May 6, 2008 (also LAPACK Working Note 200).
|
| |
8
|
Barrachina, S., Castillo, M., Igual, F. D., Mayo, R, and Quintana-Orti, E. S. 2008. Solving Dense Linear Systems on Graphics Processors, Technical Report ICC 02-02-2008, Universidad Jaime I, February 2008.
|
 |
9
|
Muthu Manikandan Baskaran , Uday Bondhugula , Sriram Krishnamoorthy , J. Ramanujam , Atanas Rountev , P. Sadayappan, A compiler framework for optimization of affine loop nests for gpgpus, Proceedings of the 22nd annual international conference on Supercomputing, June 07-12, 2008, Island of Kos, Greece
[doi> 10.1145/1375527.1375562]
|
| |
10
|
|
| |
11
|
Castillo, M., Chan, E., Igual, F. D., Mayo, R., Quintana-Orti, E. S., Quintana-Orti, G., van de Geijn, R., and Van Zee, F. G. 2008. Making Programming Synonymous with Programming for Linear Algebra Libraries, FLAME Working Note #31. The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-08-20, April 17, 2008.
|
| |
12
|
Jaeyoung Choi , Jack J. Dongarra , L. Susan Ostrouchov , Antoine P. Petitet , David W. Walker , R. Clint Whaley, Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines, Scientific Programming, v.5 n.3, p.173-184, August 1996
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
 |
18
|
|
 |
19
|
|
| |
20
|
Hwu, W. W., and Kirk, D. 2007. ECE 498 AL1: Programming Massively Parallel Processors, Lecture Slides, University of Illinois, Urbana-Champaign.
|
| |
21
|
NVIDIA. 2006. NVIDIA GeForce 8800 GPU Architecture Overview, Technical Brief, November 2006.
|
| |
22
|
NVIDIA. 2008a. NVIDIA CUDA Compute Unified Device Architecture, Programming Guide, Version 2.0.
|
| |
23
|
NVIDIA. 2008b. NVIDIA GeForce GTX 200 GPU Architectural Overview, Technical Brief, May 2008.
|
| |
24
|
Quintana-Orti, G., Igual, F. D., Quintana-Orti, E. S., and van de Geijn, R. 2008. Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators, FLAME Working Note #32. The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-08-22. May 9, 2008.
|
 |
25
|
Shane Ryoo , Christopher I. Rodrigues , Sara S. Baghsorkhi , Sam S. Stone , David B. Kirk , Wen-mei W. Hwu, Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, February 20-23, 2008, Salt Lake City, UT, USA
[doi> 10.1145/1345206.1345220]
|
CITED BY 8
|
|
|
|
|
Gregorio Quintana-Ortí , Francisco D. Igual , Enrique S. Quintana-Ortí , Robert A. van de Geijn, Solving dense linear systems on platforms with multiple hardware accelerators, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
|
|
|
|
|
|
|
|
Reiji Suda , Takayuki Aoki , Shoichi Hirasawa , Akira Nukada , Hiroki Honda , Satoshi Matsuoka, Aspects of GPU for general purpose high performance computing, Proceedings of the 2009 Conference on Asia and South Pacific Design Automation, January 19-22, 2009, Yokohama, Japan
|
|
|
|
|
|
|
|
|
Guangming Tan , Ziyu Guo , Mingyu Chen , Dan Meng, Single-particle 3d reconstruction from cryo-electron microscopy images on GPU, Proceedings of the 23rd international conference on Supercomputing, June 08-12, 2009, Yorktown Heights, NY, USA
|
|