|
ABSTRACT
The minimal block storage Distributed Square Block Packed (DSBP) format for distributed memory computing on symmetric and triangular matrices is presented. Three algorithm variants (Basic, Static, and Dynamic) of the blocked right-looking Cholesky factorization are designed for the DSBP format, implemented, and evaluated. On our target machine, all variants outperform standard full-storage implementations while saving almost half the storage. Communication overhead is shown to be virtually eliminated by the Static and Dynamic variants, both of which take advantage of hardware parallelism to hide communication costs. The Basic variant is shown to yield comparable or slightly better performance than the full-storage ScaLAPACK routine PDPOTRF while clearly outperformed by both Static and Dynamic. Models of execution assuming zero communication costs and overhead are developed. For medium- and larger-sized problems, the Static schedule is near optimal on our target machine based on comparisons with these models and measurements of synchronization overhead.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Agarwal, R. C. and Gustavson, F. G. 1988. A parallel implementation of matrix multiplication and LU factorization on the IBM 3090. In Aspects of Computation on Asynchronous and Parallel Processors, M. Wright, Ed. IFIP, North-Holland, Amsterdam, The Netherlands, 217--221.
|
 |
2
|
|
| |
3
|
|
| |
4
|
Baboulin, M., Giraud, L., and Gratton, S. 2005a. A parallel distributed solver for large dense symmetric systems: Applications to geodesy and electromagnetism problems. Int. J. High Perform. Comput. Appl./19, 353--363.
|
| |
5
|
Baboulin, M., Giraud, L., Gratton, S., and Langou, J. 2005b. A distributed packed storage for large parallel calculations. Tech. rep. TR/PA/05/30. CERFACS, Toulouse, France.
|
| |
6
|
|
 |
7
|
|
| |
8
|
Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. 2007. A class of parallel tiled linear algebra algorithms for multicore architectures. Tech. rep. UT-CS-07-600. University of Tennessee, Knoxville, TN.
|
 |
9
|
Ernie Chan , Enrique S. Quintana-Orti , Gregorio Quintana-Orti , Robert van de Geijn, Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, June 09-11, 2007, San Diego, California, USA
[doi> 10.1145/1248377.1248397]
|
| |
10
|
Jaeyoung Choi , Jack J. Dongarra , L. Susan Ostrouchov , Antoine P. Petitet , David W. Walker , R. Clint Whaley, Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines, Scientific Programming, v.5 n.3, p.173-184, August 1996
|
| |
11
|
Dackland, K., Elmroth, E., Kågström, B., and Van Loan, C. 1992. Parallel block factorizations on the shared memory multiprocessor IBM 3090 VF/600J. Int. J. Supercomput. Appl. 6.1, 69--97.
|
| |
12
|
Dackland, K., Elmroth, E., and Kågström, B. 1993. A ring--oriented approach for block matrix factorizations on shared and distributed memory architectures. In SIAM Conference on Parallel Processing for Scientific Computing, R. S. et al, Ed. SIAM Publications, Philadelphia, PA, 330--338.
|
| |
13
|
|
| |
14
|
|
| |
15
|
Geist, G. A. and Heath, M. T. 1985. Parallel Cholesky factorization on a hypercube multiprocessor. Tech. rep. ORNL--6190. Oak Ridge National Lab., Oak Ridge, TN.
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
Gustavson, F. G., Gunnels, J. A., and Sexton, J. C. 2007a. Minimal data copy for dense linear algebra factorization. In PARA 2006: State of the Art in Scientific and Parallel Computing, B. Kågström et al., Eds. Lecture Notes in Computer Science, vol. 4699. Springer, Berlin, Germany, 540--549.
|
| |
21
|
Gustavson, F. G., Karlsson, L., and Kågström, B. 2007b. Three algorithms for Cholesky factorization on distributed memory using packed storage. In PARA 2006: State of the Art in Scientific and Parallel Computing, B. Kågström et al., Eds. Lecture Notes in Computer Science, vol 4699. Springer. Berlin, Germany, 550--559. Also IBM Tech. rep. RC24137.
|
| |
22
|
MPI Forum. 1995. MPI: A message passing interface standard. http://www.mpi-forum.org/.
|
 |
23
|
|
| |
24
|
Strazdins, P. 1998. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech. rep. TR-CS-98-07. Canberra 0200 ACT, Australia.
|
| |
25
|
van de Geijn, R. A. 1997. Using PLAPACK. MIT Press, Cambridge, MA.
|
|