ACM Home Page
Please provide us with feedback. Feedback
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract
Full text PdfPdf (447 KB)
Source
ACM Symposium on Parallel Algorithms and Architectures archive
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures table of contents
Calgary, AB, Canada
SESSION: High-performance parallel computation table of contents
Pages 245-252  
Year of Publication: 2009
ISBN:978-1-60558-606-9
Authors
Grey Ballard  University of California, Berkeley, CA, USA
James Demmel  University of California, Berkeley, CA, USA
Olga Holtz  University of California, Berkeley, CA, USA
Oded Schwartz  Technische Universitaet Berlin, Berlin, Germany
Sponsors
SIGOPS: ACM Special Interest Group on Operating Systems
ACM: Association for Computing Machinery
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 40,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1583991.1584054
What is a DOI?

ABSTRACT

Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n3)) matrix multiplication to Cholesky, which is used for solving dense symmetric positive definite linear systems. Second, we compare the cost of various Cholesky implementations to this lower bound, and draw the following conclusions:

(1) "Naïve" sequential algorithms for Cholesky attain neither the bandwidth nor latency lower bounds.

(2) The sequential blocked algorithm in LAPACK (with the right block size), as well as various recursive algorithms [AP00, GJ01, AGW01, ST04], and one based on work of Toledo [Tol97], can attain the bandwidth lower bound.

(3) The LAPACK algorithm can also attain the latency bound if used with blocked data structures rather than column-wise or row-wise matrix data structures, though the Toledo algorithm cannot.

(4) The recursive sequential algorithm due to [AP00] attains the bandwidth and latency lower bounds at every level of a multi-level memory hierarchy, in a "cache-oblivious" way.

(5) The parallel implementation of Cholesky in the ScaLAPACK library (again with the right block-size) attains both the bandwidth and latency lower bounds to within a poly-logarithmic factor.

Combined with prior results in [DGHL08a, DGHL08b, DGX08] this gives a complete set of communication-optimal algorithms for O(n3) implementations of three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU and QR by optimizing communication for any number of levels of memory hierarchy.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
4
 
5
N. Béreux. Out-of-core implementations of cholesky factorization: Loop-based versus recursive algorithms. SIAM Journal on Matrix Analysis and Applications, 30(4):1302--1319, 2008.
6
7
 
8
9
 
10
J. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-optimal parallel and sequential QR and LU factorizations. UC Berkeley Technical Report EECS-2008-89, Aug 1, 2008; Submitted to SIAM. J. Sci. Comp., 2008.
 
11
J. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Implementing communication-optimal parallel and sequential QR and LU factorizations. submitted to SIAM. J. Sci. Comp., 2008.
 
12
 
13
E. Elmroth, F. G. Gustavson, I. Jonsson, and B. Kågström. Recursive blocked algorithms and hybrid data structures for dense matrix library software. 46(1):3--45, March 2004.
 
14
 
15
16
 
17
IEEE standard for floating-point arithmetic. IEEE Std. 754-2008, pages 1--58, 29 2008.
 
18
 
19
 
20


Collaborative Colleagues:
Grey Ballard: colleagues
James Demmel: colleagues
Olga Holtz: colleagues
Oded Schwartz: colleagues