ACM Home Page
Please provide us with feedback. Feedback
Distributed SBP Cholesky factorization algorithms with near-optimal scheduling
Full text PdfPdf (719 KB)
Source
ACM Transactions on Mathematical Software (TOMS) archive
Volume 36 ,  Issue 2  (March 2009) table of contents
Article No. 11  
Year of Publication: 2009
ISSN:0098-3500
Authors
Fred G. Gustavson  IBM T.J. Watson Research Center, Yorktown Heights, NY and Umeå University, Umeå, Sweden
Lars Karlsson  Umeå University, Umeå, Sweden
Bo Kågström  Umeå University, Umeå, Sweden
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 16,   Downloads (12 Months): 72,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1499096.1499100
What is a DOI?

ABSTRACT

The minimal block storage Distributed Square Block Packed (DSBP) format for distributed memory computing on symmetric and triangular matrices is presented. Three algorithm variants (Basic, Static, and Dynamic) of the blocked right-looking Cholesky factorization are designed for the DSBP format, implemented, and evaluated. On our target machine, all variants outperform standard full-storage implementations while saving almost half the storage. Communication overhead is shown to be virtually eliminated by the Static and Dynamic variants, both of which take advantage of hardware parallelism to hide communication costs. The Basic variant is shown to yield comparable or slightly better performance than the full-storage ScaLAPACK routine PDPOTRF while clearly outperformed by both Static and Dynamic. Models of execution assuming zero communication costs and overhead are developed. For medium- and larger-sized problems, the Static schedule is near optimal on our target machine based on comparisons with these models and measurements of synchronization overhead.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Agarwal, R. C. and Gustavson, F. G. 1988. A parallel implementation of matrix multiplication and LU factorization on the IBM 3090. In Aspects of Computation on Asynchronous and Parallel Processors, M. Wright, Ed. IFIP, North-Holland, Amsterdam, The Netherlands, 217--221.
2
 
3
 
4
Baboulin, M., Giraud, L., and Gratton, S. 2005a. A parallel distributed solver for large dense symmetric systems: Applications to geodesy and electromagnetism problems. Int. J. High Perform. Comput. Appl./19, 353--363.
 
5
Baboulin, M., Giraud, L., Gratton, S., and Langou, J. 2005b. A distributed packed storage for large parallel calculations. Tech. rep. TR/PA/05/30. CERFACS, Toulouse, France.
 
6
7
 
8
Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. 2007. A class of parallel tiled linear algebra algorithms for multicore architectures. Tech. rep. UT-CS-07-600. University of Tennessee, Knoxville, TN.
9
 
10
 
11
Dackland, K., Elmroth, E., Kågström, B., and Van Loan, C. 1992. Parallel block factorizations on the shared memory multiprocessor IBM 3090 VF/600J. Int. J. Supercomput. Appl. 6.1, 69--97.
 
12
Dackland, K., Elmroth, E., and Kågström, B. 1993. A ring--oriented approach for block matrix factorizations on shared and distributed memory architectures. In SIAM Conference on Parallel Processing for Scientific Computing, R. S. et al, Ed. SIAM Publications, Philadelphia, PA, 330--338.
 
13
 
14
 
15
Geist, G. A. and Heath, M. T. 1985. Parallel Cholesky factorization on a hypercube multiprocessor. Tech. rep. ORNL--6190. Oak Ridge National Lab., Oak Ridge, TN.
 
16
 
17
18
 
19
 
20
Gustavson, F. G., Gunnels, J. A., and Sexton, J. C. 2007a. Minimal data copy for dense linear algebra factorization. In PARA 2006: State of the Art in Scientific and Parallel Computing, B. Kågström et al., Eds. Lecture Notes in Computer Science, vol. 4699. Springer, Berlin, Germany, 540--549.
 
21
Gustavson, F. G., Karlsson, L., and Kågström, B. 2007b. Three algorithms for Cholesky factorization on distributed memory using packed storage. In PARA 2006: State of the Art in Scientific and Parallel Computing, B. Kågström et al., Eds. Lecture Notes in Computer Science, vol 4699. Springer. Berlin, Germany, 550--559. Also IBM Tech. rep. RC24137.
 
22
MPI Forum. 1995. MPI: A message passing interface standard. http://www.mpi-forum.org/.
23
 
24
Strazdins, P. 1998. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech. rep. TR-CS-98-07. Canberra 0200 ACT, Australia.
 
25
van de Geijn, R. A. 1997. Using PLAPACK. MIT Press, Cambridge, MA.

Collaborative Colleagues:
Fred G. Gustavson: colleagues
Lars Karlsson: colleagues
Bo Kågström: colleagues