ACM Home Page
Please provide us with feedback. Feedback
QR decomposition on GPUs
Full text PdfPdf (884 KB)
Source ACM International Conference Proceeding Series; Vol. 383 archive
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units table of contents
Washington, D.C.
Pages 71-78  
Year of Publication: 2009
ISBN:978-1-60558-517-8
Authors
Andrew Kerr  Georgia Tech Research Institute
Dan Campbell  Georgia Tech Research Institute
Mark Richards  Georgia Tech Research Institute
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 68,   Downloads (12 Months): 249,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1513895.1513904
What is a DOI?

ABSTRACT

QR decomposition is a computationally intensive linear algebra operation that factors a matrix A into the product of a unitary matrix Q and upper triangular matrix R. Adaptive systems commonly employ QR decomposition to solve overdetermined least squares problems. Performance of QR decomposition is typically the crucial factor limiting problem sizes.

Graphics Processing Units (GPUs) are high-performance processors capable of executing hundreds of floating point operations in parallel. As commodity accelerators for 3D graphics, GPUs offer tremendous computational performance at relatively low costs. While GPUs are favorable to applications with much inherent parallelism requiring coarse-grain synchronization between processors, methods for efficiently utilizing GPUs for algorithms computing QR decomposition remain elusive.

In this paper, we discuss the architectural characteristics of GPUs and explain how a high-performance implementation of QR decomposition may be implemented. We provide detailed performance analysis of the resulting implementation for real-valued matrices and offer recommendations for achieving high performance to future developers of dense linear algebra procedures for GPUs. Our implementation sustains 143 GFLOP/s, and we believe this is the fastest announced QR implementation executing entirely on the GPU.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
NVIDIA Corporation, Santa Clara, California, NVIDIA CUDA Compute Unified Device Architecture, 2008.
 
2
Khronos OpenCL Working Group, The OpenCL Specification, 2008.
 
3
A. Kerr, D. Campbell, and M. Richards, GPU Performance Assessment with the HPEC Challenge, in HPEC Workshop 2008, Lexington, MA, 2008, MIT Lincoln Laboratory.
 
4
R. Haney, T. Meuse, J. Kepner, and J. Lebak, HPEC Challenge Overview, MIT Lincoln Laboratory, 2005.
 
5
 
6
C. H. Bischof and C. V. Loan, The WY Representation for Products of Householder Matrices, Cornell University, Ithaca, NY, USA, 1985.
7
8
 
9
H. Hoffmann, Stream Algorithms and Architecture, Master's thesis, Massachusetts Institute of Technology, 2003.
 
10
NVIDIA, CUDA CUBLAS Library, NVIDIA Corporation, Santa Clara, California, 2008.
 
11
 
12
A. Kerr, D. Campbell, and M. Richards, GPU VSIPL, in HPEC Workshop 2008, Lexington, MA, 2008, MIT Lincoln Laboratory.
 
13
M. Baboulin, J. Dongarra, and S. Tomov, Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures, Technical Report UT-CS-08-200, University of Tennessee, 2008.
 
14
D. A. Schwartz, R. R. Judd, W. J. Harrod, and D. P. Manley, VSIPL 1.3 API, VSIPL Forum, 2008.
Collaborative Colleagues:
Andrew Kerr: colleagues
Dan Campbell: colleagues
Mark Richards: colleagues