ACM Home Page
Please provide us with feedback. Feedback
The RISC BLAS: a blocked implementation of level 3 BLAS for RISC processors
Full text PdfPdf (158 KB)
Source ACM Transactions on Mathematical Software (TOMS) archive
Volume 25 ,  Issue 3  (September 1999) table of contents
Pages: 316 - 340  
Year of Publication: 1999
ISSN:0098-3500
Authors
Michel J. Daydé  ENSEEIT-IRIT, Toulouse, France
Iain S. Duff  CERFACS and Rutherford Appleton Lab., Oxon, England
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 2,   Downloads (12 Months): 32,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues   peer to peer  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/326147.326150
What is a DOI?

ABSTRACT

We describe a version of the Level 3 BLAS which is designed to be efficient on RISC processors. This is an extension of previous studies by the authors and colleagues on a similar approach for efficient serial and parallel implementations on virtual-memory and shared-memory multiprocessors. All our codes are written in Fortran and use loop-unrolling, blocking, and copying to improve the performance. A blocking technique is used to express the BLAS in terms of operations involving triangular blocks and calls to the matrix-matrix multiplication kernel (GEMM). No manufacturer-supplied or assembler code is used. This blocked implementation uses the same blocking ideas as in our implementation for vector machines except that the ordering of loops is designed for efficient reuse of date held in cache and not necessarily for parallelization. All the codes are specifically tuned for RISC processors. The software also includes a tuned version of GEMM. A parameter which controls the blocking allows efficient exploitation of the memory hierarchy on the various target computers. We present results on a range of RISC-based workstations and multiprocessors: CRAY T3D, DEC 8400 5/300, HP 715/64, IBM SP2, MEIKO CS2-HA, SGI Power Challenge 10000, and SUN UltraSPARC-1 model 140. This implementation of the Level 3 BLAS is available on anonymous FTP, and we welcome input from users to improve and extend our BLAS implementation.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
AMESTOY,P.R.AND DAYD~, M. J. 1993. Tuned block implementation of Level 3 BLAS for the CONVEX C220 and RISC processors. Distributed implementation of LU factorization using PVM. Tech. Rep. RT/APO/93/3 Toulouse, France.
 
3
AMESTOY,P.R.AND DUFF, I. S. 1989. Vectorization of a multiprocessor multifrontal code. Int. J. Supercomput. Appl. High Perform. Eng. 3, 3, 41-59.
 
4
AMESTOY,P.R.,DAYD~,M.J.,DUFF,I.S.,AND MOR~RE, P. 1995. Linear algebra calculations on a virtual shared memory computer. Int. J. High Speed Comput. 7, 1, 21-43.
 
5
 
6
BELL, R. 1991. IBM RISC System/6000 NIC tuning guide for Fortran and C. Tech. Rep. GG24-3611-01, IBM International Technical Support Centers.
 
7
BERGER,P.AND DAYD~, M. J. 1991. Implementation and use of Level 3 BLAS kernels on a Transputer T800 ring network. Tech. Rep. TR/PA/91/54, CERFACS.
 
8
BODIN,F.AND SEZNEC, A. 1994. Cache organization influence on loop blocking. Tech. Rep. 803, IRISA, Rennes, France.
 
9
DAYD~,M.J.AND DUFF, I. S. 1989. Use of Level 3 BLAS in LU factorization on the CRAY-2, the ETA 10-P, and the IBM 3090 VF. Int. J. Supercomput. Appl. High Perform. Eng. 3,2, 40-70.
 
10
DAYD~,M.J.AND DUFF, I. S. 1995. Porting industrial codes and developing sparse linear solvers on parallel computers. Comput. Syst. Eng. 4, 5, 295-305.
 
11
DAYD~,M.J.AND DUFF, I. S. 1996. A blocked implementation of Level 3 BLAS for RISC processors. Tech. Rep. RAL-TR-96-014. Rutherford Appleton Lab., Didcot, Oxon, United Kingdom. Also ENSEEIHT-IRIT Tech. Rep. RT/APO/96/1 and CERFACS Rep. TR/PA/96/06.
 
12
DAYD~,M.J.AND DUFF, I. S. 1997. A block implementation of Level 3 BLAS for RISC processors, revised version. Tech. Rep. RT/APO/97/2, ENSEEIHT-IRIT.
13
14
15
 
16
 
17
 
18
GALLIVAN, K., JALBY, W., MEIER, U., AND SAMEH, A. H. 1988. Impact of hierarchical memory systems on linear algebra algorithm design. Int. J. Supercomput. Appl. High Perform. Eng. 2, 1, 12-48.
 
19
20
21
 
22
 
23
PUGLISI, C. 1993. QR Factorization of large sparse overdetermined and square matrices using the multifrontal method in a multiprocessor environment. Ph.D. Dissertation.
 
24
QRICHI ANIBA, A. 1994. Impl~mentation performante du BLAS de niveau 3 pour les processeurs RISC. Tech. Rep. Rapport 3~me Ann~e. D~partement Informatique et Math~-matiques Appliqu~es, ENSEEIHT, Toulouse, France.
 
25
SHEIKH,Q.AND LIU, J. 1989. Basic linear algebra subprogram optimization on the CRAY-2 system. CRAY Channels Spring.

Collaborative Colleagues:
Michel J. Daydé: colleagues
Iain S. Duff: colleagues

Peer to Peer - Readers of this Article have also read: