ACM Home Page
Please provide us with feedback. Feedback
Anatomy of high-performance matrix multiplication
Full text PdfPdf (776 KB)
Source
ACM Transactions on Mathematical Software (TOMS) archive
Volume 34 ,  Issue 3  (May 2008) table of contents
Article No. 12  
Year of Publication: 2008
ISSN:0098-3500
Authors
Kazushige Goto  The University of Texas at Austin, Austin, TX
Robert A. van de Geijn  The University of Texas at Austin, Austin, TX
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 76,   Downloads (12 Months): 441,   Citation Count: 12
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1356052.1356053
What is a DOI?

ABSTRACT

We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve near-peak performance.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
4
5
6
7
 
8
Goto, K. 2005. www.tacc.utexas.edu/resources/software/.
 
9
Goto, K. and van de Geijn, R. 2006. High-performance implementation of the level-3 BLAS. FLAME Working Note #20, Tech. rep. TR-2006-23, Department of Computer Sciences, The University of Texas at Austin.
 
10
Goto, K. and van de Geijn, R. A. 2002. On reducing TLB misses in matrix multiplication. Tech. rep. CS-TR-02-55, Department of Computer Sciences, The University of Texas at Austin.
 
11
Gunnels, J., Gustavson, F., Henry, G., and van de Geijn, R. A. 2005. A family of high-performance matrix multiplication algrorithms. Lecture Notes in Computer Science, vol. 3732. Springer-Verlag, 256--265.
12
 
13
14
15
 
16
 
17
Strazdins, P. E. 1998. Transporting distributed BLAS to the Fujitsu AP3000 and VPP-300. In Proceedings of the 8th Parallel Computing Workshop (PCW'98). 69--76.
 
18
 
19
Whaley, R. C., Petitet, A., and Dongarra, J. J. 2001. Automated empirical optimization of software and the ATLAS project. Parall. Comput. 27, 1--2, 3--35.

CITED BY  12


REVIEW

"James Harold Davenport : Reviewer"

"How hard can it be to implement matrix multiply?" One can imagine an irascible reader asking that of this 25-page paper, but it is the wrong question. This paper addresses the implementation of matrix multiplication efficiently, on a range of mod  more...

Collaborative Colleagues:
Kazushige Goto: colleagues
Robert A. van de Geijn: colleagues