ACM Home Page
Please provide us with feedback. Feedback
Adaptive Winograd's matrix multiplications
Full text PdfPdf (910 KB)
Source
ACM Transactions on Mathematical Software (TOMS) archive
Volume 36 ,  Issue 1  (March 2009) table of contents
Article No. 3  
Year of Publication: 2009
ISSN:0098-3500
Authors
Paolo D'Alberto  Yahoo! Inc., Santa Clara, CA
Alexandru Nicolau  University of California, Irvine, Irvine, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 169,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1486525.1486528
What is a DOI?

ABSTRACT

Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex tradeoff, not just a simple matter of counting cost of simple CPU operations.

We present a novel, hybrid, and adaptive recursive Strassen-Winograd's matrix multiplication (MM) that uses automatically tuned linear algebra software (ATLAS) or GotoBLAS. Our algorithm applies to any size and shape matrices stored in either row or column major layout (in double precision in this work) and thus is efficiently applicable to both C and FORTRAN implementations. In addition, our algorithm divides the computation into equivalent in-complexity sub-MMs and does not require any extra computation to combine the intermediary sub-MM results.

We achieve up to 22% execution-time reduction versus GotoBLAS/ATLAS alone for a single core system and up to 19% for a two dual-core processor system. Most importantly, even for small matrices such as 1500 × 1500, our approach attains already 10% execution-time reduction and, for MM of matrices larger than 3000× 3000, it delivers performance that would correspond, for a classic O(n3) algorithm, to faster-than-processor peak performance (i.e., our algorithm delivers the equivalent of 5 GFLOPS performance on a system with 4.4 GFLOPS peak performance and where GotoBLAS achieves only 4 GFLOPS). This is a result of the savings in operations (and thus FLOPS). Therefore, our algorithm is faster than any classic MM algorithms could ever be for matrices of this size. Furthermore, we present experimental evidence based on established methodologies found in the literature that our algorithm is, for a family of matrices, as accurate as the classic algorithms.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
4
 
5
 
6
Brent, R. P. 1970b. Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd's identity. Numer. Math. 16, 145--156.
 
7
 
8
9
 
10
 
11
D'Alberto, P. and Nicolau, A. 2005b. Using recursion to boost ATLAS's performance. In Proceedings of the Sixth International Symposium on High Performance Computing (ISHPC-VI).
12
 
13
Demmel, J., Dongarra, J., Eijkhout, E., Fuentes, E., Petitet, E., Vuduc, V., Whaley, R., and Yelick, K. 2005. Self-Adapting linear algebra algorithms and software. In Proceedings of the IEEE, Special Issue on “Program Generation, Optimization, and Adaptation.” vol. 93. 2.
 
14
15
16
17
 
18
 
19
Eiron, N., Rodeh, M., and Steinwarts, I. 1998. Matrix multiplication: A case study of algorithm engineering. In Proceedings of WAE'98 (Saarbrücken, Germany).
20
 
21
Frigo, M. and Johnson, S. 2005. The design and implementation of FFTW3. In Proc. IEEE (Special issue on “Program Generation, Optimization, and Adaptation”) 93, 2, 216--231.
22
 
23
24
25
 
26
 
27
Huss-Lederman, S., Jacobson, E., Johnson, J., Tsao, A., and Turnbull, T. 1996a. Strassen's algorithm for matrix multiplication: Modeling, analysis, and implementation. Tech. rep. CCS-TR-96-14. Center for Computing Sciences, University of Wisconsin-Madison, Madison, WI.
 
28
29
30
 
31
Kaporin, I. 1999. A practical algorithm for faster matrix multiplication. Numer. Lin. Alg. Appl. 6, 8, 687--700. Centre for Supercomputer and Massively Parallel Applications, Computing Centre of the Russian Academy of Sciences, Vavilova 40, Moscow 117967, Russia.
 
32
33
 
34
 
35
 
36
Ohtaki, Y., Takahashi, D., Boku, T., and Sato, M. 2004. Parallel implementation of Strassen's matrix multiplication algorithm for heterogeneous clusters. In Proceedings of the 18th International Parallel and Distributed Processing Symposium. 112. http://doi.ieeecomputersociety.org/10.1109/IPDPS.2004.1303066.
 
37
 
38
 
39
 
40
Priest, D. 1991. Algorithms for arbitrary precision floating point arithmetic. In Proceedings of the 10th IEEE Symposium on Computer Arithmetic (Arith-10), P. Kornerup and D. W. Matula, Eds. IEEE Computer Society Press, Los Alamitos, CA, 132--144.
 
41
Püschel, M., Moura, J., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R., and Rizzolo, N. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE (Special issue on “Program Generation, Optimization, and Adaptation”) 93, 2, 232--275.
 
42
Strassen, V. 1969. Gaussian elimination is not optimal. Numer. Math. 14, 3, 354--356.
 
43
 
44
 
45

Collaborative Colleagues:
Paolo D'Alberto: colleagues
Alexandru Nicolau: colleagues