|
ABSTRACT
Strassen's matrix multiplication (MM) has benefits with respect to any (highly tuned) implementations of MM because Strassen's reduces the total number of operations. Strassen achieved this operation reduction by replacing computationally expensive MMs with matrix additions (MAs). For architectures with simple memory hierarchies, having fewer operations directly translates into an efficient utilization of the CPU and, thus, faster execution. However, for modern architectures with complex memory hierarchies, the operations introduced by the MAs have a limited in-cache data reuse and thus poor memory-hierarchy utilization, thereby overshadowing the (improved) CPU utilization, and making Strassen's algorithm (largely) useless on its own. In this paper, we investigate the interaction between Strassen's effective performance and the memory-hierarchy organization. We show how to exploit Strassen's full potential across different architectures. We present an easy-to-use adaptive algorithm that combines a novel implementation of Strassen's idea with the MM from automatically tuned linear algebra software (ATLAS) or GotoBLAS. An additional advantage of our algorithm is that it applies to any size and shape matrices and works equally well with row or column major layout. Our implementation consists of introducing a final step in the ATLAS/GotoBLAS-installation process that estimates whether or not we can achieve any additional speedup using our Strassen's adaptation algorithm. Then we install our codes, validate our estimates, and determine the specific performance. We show that, by the right combination of Strassen's with ATLAS/GotoBLAS, our approach achieves up to 30%/22% speed-up versus ATLAS/GotoBLAS alone on modern high-performance single processors. We consider and present the complexity and the numerical analysis of our algorithm, and, finally, we show performance for 17 (uniprocessor) systems.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
R. P. Brent. Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd's identity. Numerische Mathematik, 16:145--156, 1970.
|
 |
6
|
Siddhartha Chatterjee , Alvin R. Lebeck , Praveen K. Patnala , Mithuna Thottethodi, Recursive array layouts and fast parallel matrix multiplication, Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, p.222-231, June 27-30, 1999, Saint Malo, France
[doi> 10.1145/305619.305645]
|
| |
7
|
H. Cohn, R. Kleinberg, B. Szegedy, and C. Umans. Group-theoretic algorithms for matrix multiplication, Nov 2005.
|
 |
8
|
|
| |
9
|
|
| |
10
|
P. D'Alberto and A. Nicolau. Using recursion to boost ATLAS's performance. In The Sixth International Symposium on High Performance Computing (ISHPC-VI), 2005.
|
| |
11
|
J. Demmel, J. Dongarra, E. Eijkhout, E. Fuentes, E. Petitet, V. Vuduc, R. C. Whaley, and K. Yelick. Self-Adapting linear algebra algorithms and software. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2), 2005.
|
| |
12
|
J. Demmel, J. Dumitriu, O. Holtz, and R. Kleinberg. Fast matrix multiplication is stable, Mar 2006.
|
 |
13
|
|
| |
14
|
N. Eiron, M. Rodeh, and I. Steinwarts. Matrix multiplication: a case study of algorithm engineering. In Proceedings WAE'98, Saarbrücken, Germany, Aug 1998.
|
 |
15
|
|
| |
16
|
M. Frigo and S. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):216--231, 2005.
|
| |
17
|
|
| |
18
|
K. Goto and R. A. van de Geijn. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software.
|
 |
19
|
|
 |
20
|
|
| |
21
|
|
| |
22
|
Steven Huss-Lederman , Elaine M. Jacobson , Anna Tsao , Thomas Turnbull , Jeremy R. Johnson, Implementation of Strassen's algorithm for matrix multiplication, Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), p.32-es, January 01-01, 1996, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/369028.369096]
|
| |
23
|
I. Kaporin. A practical algorithm for faster matrix multiplication. Numerical Linear Algebra with Applications, 6(8):687--700, 1999. Centre for Supercomputer and Massively Parallel Applications, Computing Centre of the Russian Academy of Sciences, Vavilova 40, Moscow 117967, Russia.
|
| |
24
|
|
| |
25
|
P. Knight. Fast rectangular matrix multiplication and QR-decomposition. Linear algebra and its applications, 221:69--81, 1995.
|
| |
26
|
|
| |
27
|
V. Pan. Strassen's algorithm is not optimal: Trililnear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations. In FOCS, pages 166--176, 1978.
|
| |
28
|
|
| |
29
|
D. Priest. Algorithms for arbitrary precision floating point arithmetic. In P. Kornerup and D. W. Matula, editors, Proceedings of the 10th IEEE Symposium on Computer Arithmetic (Arith-10), pages 132--144, Grenoble, France, 1991. IEEE Computer Society Press, Los Alamitos, CA.
|
| |
30
|
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gačić, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2), 2005.
|
| |
31
|
V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 14(3):354--356, 1969.
|
| |
32
|
|
| |
33
|
|
REVIEW
"James Harold Davenport : Reviewer"
This paper studies how Strassen’s matrix multiplication, a divide-and-conquer approach, can be used in practice on real machines with today’s architectures. The authors are essentially only considering single processors. Though some du
more...
|