|
ABSTRACT
With the emergence of thread-level parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of contiguous blocks, facilitating algorithms-by-blocks. Operand descriptions are registered for a particular operation a priori by the library implementor. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithms-by-blocks. We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest that high performance is abundantly achievable.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
E. Anderson , Z. Bai , C. Bischof , L. S. Blackford , J. Demmel , Jack J. Dongarra , J. Du Croz , S. Hammarling , A. Greenbaum , A. McKenney , D. Sorensen, LAPACK Users' guide (third ed.), Society for Industrial and Applied Mathematics, Philadelphia, PA, 1999
|
| |
4
|
Balay, S., Buschelman, K., Eijkhout, V., Gropp, W. D., Kaushik, D., Knepley, M. G., McInnes, L. C., Smith, B. F., and Zhang, H. 2004. PETSc users manual. Tech. rep. ANL-95/11—Revision 2.1.5, Argonne National Laboratory, Argonne.
|
 |
5
|
|
 |
6
|
|
| |
7
|
Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. 2007. A class of parallel tiled linear algebra algorithms for multicore architectures. LAPACK Working Note 191 UT-CS-07-600. University of Knoxville.
|
| |
8
|
|
 |
9
|
Ernie Chan , Enrique S. Quintana-Orti , Gregorio Quintana-Orti , Robert van de Geijn, Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, June 09-11, 2007, San Diego, California, USA
[doi> 10.1145/1248377.1248397]
|
 |
10
|
Ernie Chan , Field G. Van Zee , Paolo Bientinesi , Enrique S. Quintana-Orti , Gregorio Quintana-Orti , Robert van de Geijn, SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, February 20-23, 2008, Salt Lake City, UT, USA
[doi> 10.1145/1345206.1345227]
|
| |
11
|
|
| |
12
|
Choi, J., Dongarra, J. J., Pozo, R., and Walker, D. W. 1992. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the 4th Symposium on the Frontiers of Massively Parallel Computation, McLean, 120--127.
|
| |
13
|
|
| |
14
|
Dongarra, J. J., Bunch, J. R., Moler, C. B., and Stewart, G. W. 1979. LINPACK Users' Guide. SIAM, Philadelphia.
|
 |
15
|
|
 |
16
|
|
| |
17
|
Edwards, H. C. and van de Geijn, R. A. 2006. On application interfaces to parallel dense matrix libraries: Just let me solve my problem! FLAME Working Note #18 TR-2006-15. Department of Computer Sciences, University of Texas at Austin.
|
| |
18
|
Elmroth, E., Gustavson, F., Jonsson, I., and Kagstrom, B. 2004. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. 46, 1, 3--45.
|
| |
19
|
|
 |
20
|
|
| |
21
|
Gropp, W., Lusk, E., and Skjellum, A. 1994. Using MPI. MIT Press, Cambridge.
|
 |
22
|
|
 |
23
|
|
 |
24
|
Jia Guo , Ganesh Bikshandi , Basilio B. Fraguela , Maria J. Garzaran , David Padua, Programming with tiles, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, February 20-23, 2008, Salt Lake City, UT, USA
[doi> 10.1145/1345206.1345225]
|
| |
25
|
Gustavson, F. G., Karlsson, L., and Kagstrom, B. 2007. Three algorithms for Cholesky factorization on distributed memory using packed storage. In Proceedings of the Workshop on State-of-the-Art in Scientific Computing. Lecture Notes in Computer Science, vol. 4699. Springer, Berlin/Heidelberg, Germany, 550--559.
|
| |
26
|
Herrero, J. R. 2006. A framework for efficient execution of matrix computations. Ph.D. dissertation. Polytechnic University of Catalonia, Barcelona, Spain.
|
| |
27
|
Joffrain, T., Quintana-Ortí, E. S., and van de Geijn, R. A. 2004. Rapid development of high-performance out-of-core solvers. In Proceedings of the Workshop on State-of-the-Art in Scientific Computing. Lecture Notes in Computer Science, vol. 3732. Springer, Berlin/Heidelberg, Germany, 413--422.
|
| |
28
|
Kurzak, J. and Dongarra, J. 2006. Implementing linear algebra routines on multi-core processors with pipelining and a look ahead. LAPACK Working Note 178 UT-CS-06-581. University of Tennessee, Knoxville.
|
 |
29
|
|
| |
30
|
Low, T. M. and van de Geijn, R. 2004. An API for manipulating matrices stored by blocks. FLAME Working Note #12 TR-2004-15. Department of Computer Sciences, University of Texas at Austin, Austin.
|
 |
31
|
Honghui Lu , Alan L. Cox , Sandhya Dwarkadas , Ramakrishnan Rajamony , Willy Zwaenepoel, Compiler and software distributed shared memory support for irregular applications, Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.48-56, June 18-21, 1997, Las Vegas, Nevada, United States
|
| |
32
|
Marker, B. A., Van Zee, F. G., Goto, K., Quintana-Ortí, G., and van de Geijn, R. A. 2007. Toward scalable matrix multiply on multithreaded architectures. In Proceedings of the 13th International European Conference on Parallel and Distributed Computing (Rennes, France). 748--757.
|
| |
33
|
|
 |
34
|
|
| |
35
|
Quintana-Ortí, G., Quintana-Ortí, E. S., Chan, E., van de Geijn, R., and Van Zee, F. G. 2008a. Design of scalable dense linear algebra libraries for multithreaded architectures: The LU factorization. In Proceedings of the Workshop on Multithreaded Architectures and Applications, Miami, 1--8.
|
| |
36
|
Gregorio Quintana-Orti , Enrique S. Quintana-Orti , Ernie Chan , Robert A. van de Geijn , Field G. Van Zee, Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures, Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008), p.301-310, February 13-15, 2008
[doi> 10.1109/PDP.2008.37]
|
| |
37
|
Gregorio Quintana-Ortí , Enrique S. Quintana-Ortí , Alfredo Remón , Robert A. Geijn, An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization, High Performance Computing for Computational Science - VECPAR 2008: 8th International Conference, Toulouse, France, June 24-27, 2008. Revised Selected Papers, Springer-Verlag, Berlin, Heidelberg, 2008
[doi> 10.1007/978-3-540-92859-1_21]
|
| |
38
|
Skagestein, G. 1972. Rekursiv unterteilte matrizen sowie methoden zur erstellung von rechnerprogrammen fur ihre verarbeitung. Ph.D. dissertation. Universität Stuttgart, Stuttgart, Germany.
|
| |
39
|
|
| |
40
|
Strazdins, P. 2001. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Int. J. Parall. Distrib. Syst. Netw. 4, 1, 26--35.
|
| |
41
|
|
| |
42
|
Valsalam, V. and Skjellum, A. 2002. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurr. Computat. Pract. Exper. 14, 10, 805--840.
|
| |
43
|
|
| |
44
|
van de Geijn, R. A. and Quintana-Ortí, E. S. 2008. The Science of Programming Matrix Computations. www.lulu.com.
|
| |
45
|
|
 |
46
|
David S. Wise , Jeremy D. Frens , Yuhong Gu , Gregory A. Alexander, Language support for Morton-order matrices, Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming, p.24-33, June 2001, Snowbird, Utah, United States
|
|