|
ABSTRACT
We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve near-peak performance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
E. Anderson , Z. Bai , C. Bischof , L. S. Blackford , J. Demmel , Jack J. Dongarra , J. Du Croz , S. Hammarling , A. Greenbaum , A. McKenney , D. Sorensen, LAPACK Users' guide (third ed.), Society for Industrial and Applied Mathematics, Philadelphia, PA, 1999
|
| |
3
|
Leonardo Bachega , Siddhartha Chatterjee , Kenneth A. Dockser , John A. Gunnels , Manish Gupta , Fred G. Gustavson , Christopher A. Lapkowski , Gary K. Liu , Mark P. Mendell , Charles D. Wait , T. J. Chris Ward, A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design, Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, p.85-96, September 29-October 03, 2004
[doi> 10.1109/PACT.2004.2]
|
 |
4
|
|
 |
5
|
|
 |
6
|
|
 |
7
|
|
| |
8
|
Goto, K. 2005. www.tacc.utexas.edu/resources/software/.
|
| |
9
|
Goto, K. and van de Geijn, R. 2006. High-performance implementation of the level-3 BLAS. FLAME Working Note #20, Tech. rep. TR-2006-23, Department of Computer Sciences, The University of Texas at Austin.
|
| |
10
|
Goto, K. and van de Geijn, R. A. 2002. On reducing TLB misses in matrix multiplication. Tech. rep. CS-TR-02-55, Department of Computer Sciences, The University of Texas at Austin.
|
| |
11
|
Gunnels, J., Gustavson, F., Henry, G., and van de Geijn, R. A. 2005. A family of high-performance matrix multiplication algrorithms. Lecture Notes in Computer Science, vol. 3732. Springer-Verlag, 256--265.
|
 |
12
|
|
| |
13
|
|
 |
14
|
|
 |
15
|
|
| |
16
|
|
| |
17
|
Strazdins, P. E. 1998. Transporting distributed BLAS to the Fujitsu AP3000 and VPP-300. In Proceedings of the 8th Parallel Computing Workshop (PCW'98). 69--76.
|
| |
18
|
|
| |
19
|
Whaley, R. C., Petitet, A., and Dongarra, J. J. 2001. Automated empirical optimization of software and the ATLAS project. Parall. Comput. 27, 1--2, 3--35.
|
CITED BY 12
|
|
|
|
|
|
|
|
Michael Kistler , John Gunnels , Daniel Brokenshire , Brad Benton, Petascale computing with accelerators, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mark Gebhart , Bertrand A. Maher , Katherine E. Coons , Jeff Diamond , Paul Gratz , Mario Marino , Nitya Ranganathan , Behnam Robatmili , Aaron Smith , James Burrill , Stephen W. Keckler , Doug Burger , Kathryn S. McKinley, An evaluation of the TRIPS computer system, ACM SIGPLAN Notices, v.44 n.3, March 2009
|
|
|
|
|
|
|
REVIEW
"James Harold Davenport : Reviewer"
"How hard can it be to implement matrix multiply?" One can imagine an irascible reader asking that of this 25-page paper, but it is the wrong question. This paper addresses the implementation of matrix multiplication efficiently, on a range of mod
more...
|