|
ABSTRACT
We present CALU, a Communication Avoiding algorithm for the LU factorization of dense matrices distributed in a two-dimensional cyclic layout. The algorithm is based on a new pivoting strategy, which is stable in practice. The new algorithm is optimal (up to polylogarithmic factors) in the amount of communication it performs. Our experiments show that CALU leads to a reduction in the parallel time, in particular when the latency time is an important factor of the overall time. The factorization of a block-column, a subroutime of CALU, outperforms the corresponding routine PDGETF2 from ScaLAPACK up to a factor of 4.37 on an IBM POWER 5 system and up to a factor of 5.58 on a Cray XT4 system. On square matrices of order 104, CALU outperforms the corresponding routine PDGETRF from ScaLAPACK by a factor of 1.24 on IBM POWER 5 and by a factor of 1.31 on Cray XT4.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Top 500 supercomputer sites. available at www.top500.org.
|
| |
2
|
M. Baboulin, J. Dongarra, and S. Tomov. Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures. Technical Report UT-CS-08-615, University of Tennessee, 2008. LAPACK Working Note 200.
|
| |
3
|
A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Technical Report UT-CS-07-600, University of Tennessee, 2007. LAPACK Working Note 191.
|
| |
4
|
Jaeyoung Choi , Jack J. Dongarra , L. Susan Ostrouchov , Antoine P. Petitet , David W. Walker , R. Clint Whaley, Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines, Scientific Programming, v.5 n.3, p.173-184, August 1996
|
| |
5
|
J. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-optimal parallel and sequential QR and LU factorizations. Technical Report UCB/EECS-2008-89, UC Berkeley, 2008. LAPACK Working Note 204.
|
| |
6
|
J. J. Dongarra, P. Luszczek, and A. Petitet. The LIN-PACK Benchmark: Past, Present and Future. Concurrency: Practice and Experience, 15:803--820, 2003.
|
| |
7
|
|
| |
8
|
S. L. Graham, M. Snir, and C. A. Patterson, editors. Getting Up To Speed: The Future Of Supercomputing. National Academies Press, Washington, D.C., USA, 2005.
|
| |
9
|
|
| |
10
|
G. Quintana-Orti, E. S. Quiutana-Orti, E. Chan, F. G. Van Zee, and R. van de Geijn. Programming algorithms-by-blocks for matrix computations on multithreaded architectures. Technical Report TR-08-04, University of Texas at Austin, 2008. FLAME Working Note 29.
|
| |
11
|
D. Skinner. IBM SP Parallel Scaling Overview. http://www.nersc.gov/news/reports/technical/seaborg_scaling.
|
| |
12
|
A. Tate. Personal communication.
|
| |
13
|
|
| |
14
|
|
| |
15
|
|
CITED BY
|
|
Grey Ballard , James Demmel , Olga Holtz , Oded Schwartz, Communication-optimal parallel and sequential Cholesky decomposition: extended abstract, Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, August 11-13, 2009, Calgary, AB, Canada
|
|