|
ABSTRACT
This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrix-vector operations that should provide for efficient and portable implementations of algorithms for high-performance computers
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
BARRON, D. W., AND SWINNERTON-DYER, H. P.F. Solution of simultaneous linear equations using a magnetic-tape store. Comput. J. 3 (1960), 28-33.
|
| |
2
|
BERRY, M., GALLIVAN, K., HARROD, W., JALBY, W., LO, S., MEIER, U., PHILIPPE, B., AND SAMEH, A. Parallel algorithms on the CEDAR system. CSRD Report 581, 1986.
|
| |
3
|
|
| |
4
|
BRONLUND, O. E., AND JOHNSEN, T. QR-factorization of partitioned matrices. Comput. Meth. Appl. Mech. Eng., vol. 3, pp. 153-172, 1974.
|
| |
5
|
BUCHER, I., AND JORDAN, T. Linear algebra programs for use on a vector computer with a secondary solid state storage device. In Advances in Computer Methods for Partial Differential Equations, R. Vichnevetsky and R. Stepleman, Eds. IMACS, 1984, 546-550.
|
| |
6
|
CALAHAN, D.A. Block-oriented local-memory-based linear equation solution on the CRAY-2: Uniprocessor algorithms. In Proceedings International Conference on Parallel Processing (Aug. 1986). IEEE Computer Society Press, New York, 1986.
|
| |
7
|
CARNEVALI, P., RADICATI DI BROZOLO, G., ROBERT, Y., AND SGUAZZERO, P. Efficient Fortran implementation of the Gaussian elimination and Householder reduction algorithms on the IBM 3090 vector multiprocessor. IBM ECSEC Rep. ICE-0012, 1987.
|
| |
8
|
CHARTRES, B. Adaption of the Jacobi and Givens methods for a computer with magnetic tape backup store. Univ. of Sydney Tech. Rep. 8, 1960.
|
| |
9
|
DAVE, A. K., AND DUFF, I.S. Sparse matrix calculations on the CRAY-2. Parallel Comput. 5 (July 1987), 55-64.
|
| |
10
|
DEMMEL, J., DONGARRA, J. J., DU CROZ, J., GREENBAUM, A., HAMMARLING, S., AND SORENSEN, D. Prospectus for the development of a linear algebra library for high-performance computers. Argonne National Lab. Rep. ANL-MCS-TM-97, Sept. 1987.
|
| |
11
|
DIETRICH, G. A new formulation of the hypermatrix Householder QR-decomposition. Comput. Meth. AppI. Mech. Eng. 9 (1976), 273-280.
|
 |
12
|
|
| |
13
|
DONGARRA, J. J., BUNCH, J., MOLER, C., AND STEWART, G. LINPACK Users' Guide. SIAM, Philadelphia, Pa., 1979.
|
 |
14
|
|
 |
15
|
|
 |
16
|
|
| |
17
|
|
| |
18
|
DONGARRA, J. J., GUSTAVSON, F., AND KARP, A. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Rev. 26, 1 (1984), 91-112.
|
| |
19
|
DONGARRA, J. J., HAMMARLING, S., AND SORENSEN, O. C. Block reduction of matrices to condensed forms for eigenvalue computations. Argonne National Lab. Rep. ANL-MCS-TM-99, Sept. 1987.
|
| |
20
|
DONGARRA, J. J., AND HEWITT, T. Implementing dense linear algebra using multitasking on the CRAY X-MP-4. J. Comput. Appl. Math. 27 (1989), 215-227.
|
| |
21
|
DONGARRA, J. J., AND SORENSEN, D.C. Linear algebra on high-performance computers. In Proceedings Parallel Computing 85, U. Schendel, Ed. North Holland, Amsterdam, 1986, 113-136.
|
 |
22
|
|
| |
23
|
DUFF, I. S. Full matrix techniques in sparse Gaussian elimination. In Numerical Analysis Proceedings, Dundee 1981, Lecture Notes in Mathematics 912. Springer-Verlag, New York, 1981, 71-84.
|
| |
24
|
|
| |
25
|
GEORGE, A., AND RASHWAN, S. Auxiliary storage methods for solving finite element systems. SIAM J. Sci. Star. Comput. 6, 4 (Oct. 1985), 882-910.
|
| |
26
|
IBM. Engineering and scientific subroutine library. Program 5668-863, 1986.
|
 |
27
|
|
 |
28
|
|
 |
29
|
|
| |
30
|
ROBERT, Y., AND SGUAZZERO, P. The LU decomposition algorithm and its efficient Fortran implementation on the IBM 3090 vector multiprocessor. IBM ECSEC Rep. ICE-0006, 1987.
|
| |
31
|
SCHREIBER, R. Module design specification (Version 1.0). SAXPY Computer Corp., 255 San Geronimo Way, Sunnyvale, CA 94086, 1986.
|
| |
32
|
|
CITED BY 196
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
G. von Laszewski , M. Parashar , A. G. Mohamed , G. C. Fox, On the parallelization of blocked LU factorization algorithms on distributed memory architectures, Proceedings of the 1992 ACM/IEEE conference on Supercomputing, p.170-179, November 16-20, 1992, Minneapolis, Minnesota, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
G.-S. Karamanos , C. Evangelinos , R. C. Boes , R. M. Kirby , G. E. Karniadakis, Direct numerical simulation of turbulence with a PC/linux cluster: fact or fiction?, Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), p.53-es, November 14-19, 1999, Portland, Oregon, United States
|
|
|
E. Anderson , Z. Bai , J. Dongarra , A. Greenbaum , A. McKenney , J. Du Croz , S. Hammerling , J. Demmel , C. Bischof , D. Sorensen, LAPACK: a portable linear algebra library for high-performance computers, Proceedings of the 1990 conference on Supercomputing, p.2-11, October 1990, New York, New York, United States
|
|
|
|
|
|
D. L. Dai , S. K. S. Gupta , S. D. Kaushik , J. H. Lu , R. V. Singh , C.-H. Huang , P. Sadayappan , R. W. Johnson, EXTENT: a portable programming environment for designing and implementing high-performance block recursive algorithms, Proceedings of the 1994 conference on Supercomputing, p.49-58, December 1994, Washington, D.C., United States
|
|
|
Anurag Acharya , Mustafa Uysal , Robert Bennett , Assaf Mendelson , Michael Beynon , Jeff Hollingsworth , Joel Saltz , Alan Sussman, Tuning the performance of I/O-intensive parallel applications, Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of the federated computing research conference, p.15-27, May 27-27, 1996, Philadelphia, Pennsylvania, United States
|
|
|
|
|
|
|
|
|
Nawaaz Ahmed , Nikolay Mateev , Keshav Pingali, A framework for sparse matrix code synthesis from high-level specifications, Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), p.58-es, November 04-10, 2000, Dallas, Texas, United States
|
|
|
|
|
|
|
|
|
Steven Huss-Lederman , Elaine M. Jacobson , Anna Tsao , Thomas Turnbull , Jeremy R. Johnson, Implementation of Strassen's algorithm for matrix multiplication, Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), p.32-es, January 01-01, 1996, Pittsburgh, Pennsylvania, United States
|
|
|
|
|
|
L. S. Blackford , A. Cleary , A. Petitet , R. C. Whaley , J. Demmel , I. Dhillon , H. Ren , K. Stanley , J. Dongarra , S. Hammarling, Practical experience in the numerical dangers of heterogeneous computing, ACM Transactions on Mathematical Software (TOMS), v.23 n.2, p.133-147, June 1997
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Philip Alpatov , Greg Baker , Carter Edwards , John Gunnels , Greg Morrow , James Overfelt , Robert van de Geijn , Yuan-Jye J. Wu, PLAPACK: parallel linear algebra package design overview, Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), p.1-16, November 15-21, 1997, San Jose, CA
|
|
|
|
|
|
|
|
|
|
|
|
Siddhartha Chatterjee , Alvin R. Lebeck , Praveen K. Patnala , Mithuna Thottethodi, Recursive array layouts and fast parallel matrix multiplication, Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, p.222-231, June 27-30, 1999, Saint Malo, France
|
|
|
Siddhartha Chatterjee , Vibhor V. Jain , Alvin R. Lebeck , Shyam Mundhra , Mithuna Thottethodi, Nonlinear array layouts for hierarchical memory systems, Proceedings of the 13th international conference on Supercomputing, p.444-453, June 20-25, 1999, Rhodes, Greece
|
|
|
Xiaoye S. Li , James W. Demmel , David H. Bailey , Greg Henry , Yozo Hida , Jimmy Iskandar , William Kahan , Suh Y. Kang , Anil Kapur , Michael C. Martin , Brandon J. Thompson , Teresa Tung , Daniel J. Yoo, Design, implementation and testing of extended and mixed precision BLAS, ACM Transactions on Mathematical Software (TOMS), v.28 n.2, p.152-205, June 2002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Anshul Gupta , Fred G. Gustavson , Mahesh Joshi , Sivan Toledo, The design, implementation, and evaluation of a symmetric banded linear solver for distributed-memory parallel computers, ACM Transactions on Mathematical Software (TOMS), v.24 n.1, p.74-101, March 1998
|
|
|
|
|
|
|
|
|
Sally A. McKee , Assaji Aluwihare , Benjamin H. Clark , Robert H. Klenke , Trevor C. Landon , Christopher W. Oliver , Maximo H. Salinas , Adam E. Szymkowiak , Kenneth L. Wright , Wm. A. Wulf , James H. Aylor, Design and evaluation of dynamic access ordering hardware, Proceedings of the 10th international conference on Supercomputing, p.125-132, May 25-28, 1996, Philadelphia, Pennsylvania, United States
|
|
|
|
|
|
|
|
|
|
|
|
Yong Dou , S. Vassiliadis , G. K. Kuzmanov , G. N. Gaydadjiev, 64-bit floating-point FPGA matrix multiplication, Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, February 20-22, 2005, Monterey, California, USA
|
|
|
|
|
|
D. L. Dai , S. K. S. Gupta , S. D. Kaushik , J. H. Lu , R. V. Singh , C. H. Huang , P. Sadayappan , R. W. Johnson, EXTENT: a portable programming environment for designing and implementing high-performance block recursive algorithms, Proceedings of the 1994 ACM/IEEE conference on Supercomputing, November 14-18, 1994, Washington, D.C.
|
|
|
|
|
|
|
|
|
|
|
|
Jack Dongarra , Ian Foster , Geoffrey Fox , William Gropp , Ken Kennedy , Linda Torczon , Andy White, References, Sourcebook of parallel computing, Morgan Kaufmann Publishers Inc., San Francisco, CA, 2003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sivan Toledo , Fred G. Gustavson, The design and implementation of SOLAR, a portable library for scalable out-of-core linear algebra computations, Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of the federated computing research conference, p.28-40, May 27-27, 1996, Philadelphia, Pennsylvania, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Nikita Kojekine , Vladimir Savchenko , Ichiro Hagiwara, Surface reconstruction based on compactly supported radial basis functions, Geometric modeling: techniques, applications, systems and tools, Kluwer Academic Publishers, Norwell, MA, 2004
|
|
|
|
|
|
Mahmut Kandemir , Alok Choudhary , J. Ramanujam , Meenakshi A. Kandaswamy, A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations, IEEE Transactions on Parallel and Distributed Systems, v.11 n.7, p.648-668, July 2000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Laura Susan Blackford , J. Choi , A. Cleary , A. Petitet , R. C. Whaley , J. Demmel , I. Dhillon , K. Stanley , J. Dongarra , S. Hammarling , G. Henry , D. Walker, ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance, Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), p.5-es, January 01-01, 1996, Pittsburgh, Pennsylvania, United States
|
|
|
|
|
|
Sally A. McKee , William A. Wulf , James H. Aylor , Maximo H. Salinas , Robert H. Klenke , Sung I. Hong , Dee A. B. Weikle, Dynamic Access Ordering for Streamed Computations, IEEE Transactions on Computers, v.49 n.11, p.1255-1271, November 2000
|
|
|
|
|
|
|
|
|
|
|
|
Sally A. McKee , Robert H. Klenke , Kenneth L. Wright , William A. Wulf , Maximo H. Salinas , James H. Aylor , Alan P. Batson, Smarter Memory: Improving Bandwidth for Streamed References, Computer, v.31 n.7, p.54-63, July 1998
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jarek Nieplocha , Bruce Palmer , Vinod Tipparaju , Manojkumar Krishnan , Harold Trease , Edoardo Aprà, Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit, International Journal of High Performance Computing Applications, v.20 n.2, p.203-231, May 2006
|
|
|
|
|
|
|
|
|
Jeff Bilmes , Krste Asanovic , Chee-Whye Chin , Jim Demmel, Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology, Proceedings of the 11th international conference on Supercomputing, p.340-347, July 07-11, 1997, Vienna, Austria
|
|
|
|
|
|
Sandhya Krishnan , Sriram Krishnamoorthy , Gerald Baumgartner , Chi-Chung Lam , J. Ramanujam , P. Sadayappan , Venkatesh Choppella, Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver, Journal of Parallel and Distributed Computing, v.66 n.5, p.659-673, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Il-Chul Yoon , Alan Sussman , Atif Memon , Adam Porter, Effective and scalable software compatibility testing, Proceedings of the 2008 international symposium on Software testing and analysis, July 20-24, 2008, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
Leonardo Bachega , Siddhartha Chatterjee , Kenneth A. Dockser , John A. Gunnels , Manish Gupta , Fred G. Gustavson , Christopher A. Lapkowski , Gary K. Liu , Mark P. Mendell , Charles D. Wait , T. J. Chris Ward, A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design, Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, p.85-96, September 29-October 03, 2004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Thierry Joffrain , Tze Meng Low , Enrique S. Quintana-Ortí , Robert van de Geijn , Field G. Van Zee, Accumulating Householder transformations, revisited, ACM Transactions on Mathematical Software (TOMS), v.32 n.2, p.169-179, June 2006
|
|
|
|
|
|
|
|
|
|
|
|
Bryan S. Morse , Terry S. Yoo , Penny Rheingans , David T. Chen , K. R. Subramanian, Interpolating implicit surfaces from scattered surface data using compactly supported radial basis functions, ACM SIGGRAPH 2005 Courses, July 31-August 04, 2005, Los Angeles, California
|
|
|
|
|
|
|
|
|
|
|
|
Jeffrey R. Diamond , Behnam Robatmili , Stephen W. Keckler , Robert van de Geijn , Kazushige Goto , Doug Burger, High performance dense linear algebra on a spatially distributed processor, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, February 20-23, 2008, Salt Lake City, UT, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ernie Chan , Enrique S. Quintana-Orti , Gregorio Quintana-Orti , Robert van de Geijn, Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, June 09-11, 2007, San Diego, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ernie Chan , Field G. Van Zee , Paolo Bientinesi , Enrique S. Quintana-Orti , Gregorio Quintana-Orti , Robert van de Geijn, SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, February 20-23, 2008, Salt Lake City, UT, USA
|
|
|
|
|
|
S.R. Tiyyagura , P. Adamidis , R. Rabenseifner , P. Lammers , S. Borowski , F. Lippold , F. Svensson , O. Marxen , S. Haberhauer , A.P. Seitsonen , J. Furthmüller , K. Benkert , M. Galle , T. Bönisch , U. Küster , M.M. Resch, Teraflops Sustained Performance With Real World Applications, International Journal of High Performance Computing Applications, v.22 n.2, p.131-148, May 2008
|
|
|
Enrique Arias , Angelines Alberto , Jesus Montesinos , Tomas Rojo , Fernando Cuartero , Jesus Benet, A mathematical model of the static pantograph/catenary interaction, International Journal of Computer Mathematics, v.86 n.2, p.333-340, February 2009
|
|
|
|
|
|
Subhash Saini , Robert Ciotti , Brian T. N. Gunney , Thomas E. Spelce , Alice Koniges , Don Dossa , Panagiotis Adamidis , Rolf Rabenseifner , Sunil R. Tiyyagura , Matthias Mueller, Performance evaluation of supercomputers using HPCC and IMB Benchmarks, Journal of Computer and System Sciences, v.74 n.6, p.965-982, September, 2008
|
|
|
Michael Kistler , John Gunnels , Daniel Brokenshire , Brad Benton, Petascale computing with accelerators, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jaeyoung Choi , Jack J. Dongarra , L. Susan Ostrouchov , Antoine P. Petitet , David W. Walker , R. Clint Whaley, Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines, Scientific Programming, v.5 n.3, p.173-184, August 1996
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gregorio Quintana-Ortí , Francisco D. Igual , Enrique S. Quintana-Ortí , Robert A. van de Geijn, Solving dense linear systems on platforms with multiple hardware accelerators, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
REVIEW
"Chaya Gurwitz : Reviewer"
The original set of FORTRAN basic linear algebra subprograms, or
Level 1 BLAS, included vector operations [1]; the routines in Level 2
BLAS were later added to provide matrix-vector operations [2]. This
paper proposes adding a set of Level 3 B
more...
|