|
ABSTRACT
Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex tradeoff, not just a simple matter of counting cost of simple CPU operations. We present a novel, hybrid, and adaptive recursive Strassen-Winograd's matrix multiplication (MM) that uses automatically tuned linear algebra software (ATLAS) or GotoBLAS. Our algorithm applies to any size and shape matrices stored in either row or column major layout (in double precision in this work) and thus is efficiently applicable to both C and FORTRAN implementations. In addition, our algorithm divides the computation into equivalent in-complexity sub-MMs and does not require any extra computation to combine the intermediary sub-MM results. We achieve up to 22% execution-time reduction versus GotoBLAS/ATLAS alone for a single core system and up to 19% for a two dual-core processor system. Most importantly, even for small matrices such as 1500 × 1500, our approach attains already 10% execution-time reduction and, for MM of matrices larger than 3000× 3000, it delivers performance that would correspond, for a classic O(n3) algorithm, to faster-than-processor peak performance (i.e., our algorithm delivers the equivalent of 5 GFLOPS performance on a system with 4.4 GFLOPS peak performance and where GotoBLAS achieves only 4 GFLOPS). This is a result of the savings in operations (and thus FLOPS). Therefore, our algorithm is faster than any classic MM algorithms could ever be for matrices of this size. Furthermore, we present experimental evidence based on established methodologies found in the literature that our algorithm is, for a family of matrices, as accurate as the classic algorithms.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
E. Anderson , Z. Bai , C. Bischof , L. S. Blackford , J. Demmel , Jack J. Dongarra , J. Du Croz , S. Hammarling , A. Greenbaum , A. McKenney , D. Sorensen, LAPACK Users' guide (third ed.), Society for Industrial and Applied Mathematics, Philadelphia, PA, 1999
|
| |
2
|
|
 |
3
|
Jeff Bilmes , Krste Asanovic , Chee-Whye Chin , Jim Demmel, Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology, Proceedings of the 11th international conference on Supercomputing, p.340-347, July 07-11, 1997, Vienna, Austria
[doi> 10.1145/263580.263662]
|
 |
4
|
|
| |
5
|
|
| |
6
|
Brent, R. P. 1970b. Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd's identity. Numer. Math. 16, 145--156.
|
| |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
D'Alberto, P. and Nicolau, A. 2005b. Using recursion to boost ATLAS's performance. In Proceedings of the Sixth International Symposium on High Performance Computing (ISHPC-VI).
|
 |
12
|
|
| |
13
|
Demmel, J., Dongarra, J., Eijkhout, E., Fuentes, E., Petitet, E., Vuduc, V., Whaley, R., and Yelick, K. 2005. Self-Adapting linear algebra algorithms and software. In Proceedings of the IEEE, Special Issue on “Program Generation, Optimization, and Adaptation.” vol. 93. 2.
|
| |
14
|
|
 |
15
|
|
 |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
Eiron, N., Rodeh, M., and Steinwarts, I. 1998. Matrix multiplication: A case study of algorithm engineering. In Proceedings of WAE'98 (Saarbrücken, Germany).
|
 |
20
|
|
| |
21
|
Frigo, M. and Johnson, S. 2005. The design and implementation of FFTW3. In Proc. IEEE (Special issue on “Program Generation, Optimization, and Adaptation”) 93, 2, 216--231.
|
 |
22
|
|
| |
23
|
|
 |
24
|
|
 |
25
|
|
| |
26
|
|
| |
27
|
Huss-Lederman, S., Jacobson, E., Johnson, J., Tsao, A., and Turnbull, T. 1996a. Strassen's algorithm for matrix multiplication: Modeling, analysis, and implementation. Tech. rep. CCS-TR-96-14. Center for Computing Sciences, University of Wisconsin-Madison, Madison, WI.
|
| |
28
|
Steven Huss-Lederman , Elaine M. Jacobson , Anna Tsao , Thomas Turnbull , Jeremy R. Johnson, Implementation of Strassen's algorithm for matrix multiplication, Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), p.32-es, January 01-01, 1996, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/369028.369096]
|
 |
29
|
|
 |
30
|
|
| |
31
|
Kaporin, I. 1999. A practical algorithm for faster matrix multiplication. Numer. Lin. Alg. Appl. 6, 8, 687--700. Centre for Supercomputer and Massively Parallel Applications, Computing Centre of the Russian Academy of Sciences, Vavilova 40, Moscow 117967, Russia.
|
| |
32
|
|
 |
33
|
|
| |
34
|
|
| |
35
|
Duc Kien Nguyen , Ivan Lavallee , Marc Bui , Quoc Trung Ha, A General Scalable Implementation of Fast Matrix Multiplication Algorithms on Distributed Memory Computers, Proceedings of the Sixth International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing and First ACIS International Workshop on Self-Assembling Wireless Networks, p.116-122, May 23-25, 2005
[doi> 10.1109/SNPD-SAWN.2005.2]
|
| |
36
|
Ohtaki, Y., Takahashi, D., Boku, T., and Sato, M. 2004. Parallel implementation of Strassen's matrix multiplication algorithm for heterogeneous clusters. In Proceedings of the 18th International Parallel and Distributed Processing Symposium. 112. http://doi.ieeecomputersociety.org/10.1109/IPDPS.2004.1303066.
|
| |
37
|
|
| |
38
|
|
| |
39
|
|
| |
40
|
Priest, D. 1991. Algorithms for arbitrary precision floating point arithmetic. In Proceedings of the 10th IEEE Symposium on Computer Arithmetic (Arith-10), P. Kornerup and D. W. Matula, Eds. IEEE Computer Society Press, Los Alamitos, CA, 132--144.
|
| |
41
|
Püschel, M., Moura, J., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R., and Rizzolo, N. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE (Special issue on “Program Generation, Optimization, and Adaptation”) 93, 2, 232--275.
|
| |
42
|
Strassen, V. 1969. Gaussian elimination is not optimal. Numer. Math. 14, 3, 354--356.
|
| |
43
|
|
| |
44
|
|
| |
45
|
|
|