ACM Home Page
Please provide us with feedback. Feedback
Performance optimizations and bounds for sparse matrix-vector multiply
Full text PdfPdf (867 KB)
Source Conference on High Performance Networking and Computing archive
Proceedings of the 2002 ACM/IEEE conference on Supercomputing table of contents
Baltimore, Maryland
Pages: 1 - 35  
Year of Publication: 2002
Authors
Richard Vuduc  University of California, Berkeley, Berkeley, California
James W. Demmel  University of California, Berkeley, Berkeley, California
Katherine A. Yelick  University of California, Berkeley, Berkeley, California
Shoaib Kamil  University of California, Berkeley, Berkeley, California
Rajesh Nishtala  University of California, Berkeley, Berkeley, California
Benjamin Lee  University of California, Berkeley, Berkeley, California
Sponsors
IEEE-CS\DATC : IEEE Computer Society
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
IEEE Computer Society Press  Los Alamitos, CA, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 58,   Citation Count: 10
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  

ABSTRACT

We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM × V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits.Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM × V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on a test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on a class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present a new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM × V performance (Mflop/s) by as much as 2.5× over an untuned implementation.Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM × V, will come from two sources: (1) consideration of higher-level matrix structures (e.g., exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g., sparse matrix-multiple vector multiply, multiplication of ATA by a vector).


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA, 1994.
 
2
3
 
4
S. Blackford, G. Corliss, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, C. Hu, W. Kahan, L. Kaufman, B. Kearfott, F. Krogh, X. Li, Z. Maany, A. Petitet, R. Pozo, K. Remington, W. Walster, C. Whaley, and J. W. von Gudenberg. Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001. www.netlib.org/blast.
 
5
 
6
 
7
8
 
9
T. Davis. UF Sparse Matrix Collection. www.cise.ufl.edu/research/sparse/matrices.
 
10
B. B. Fraguela, R. Doallo, and E. L. Zapata. Memory hierarchy performance prediction for sparse blocked algorithms. Parallel Processing Letters, 9(3), March 1999.
11
 
12
W. D. Gropp, D. K. Kasushik, D. E. Keyes, and B. F. Smith. Towards realistic bounds for implicit CFD codes. In Proceedings of Parallel Computational Fluid Dynamics, pages 241--248, 1999.
 
13
G. Heber, A. J. Dolgert, M. Alt, K. A. Mazurkiewicz, and L. Stringer. Fracture mechanics on the Intel Itanium architecture: A case study. In Workshop on EPIC Architectures and Compiler Technology (ACM MICRO 34), Austin, TX, December 2001.
 
14
G. M. Henry. Flexible, high-performance matrix multiply via a self-modifying runtime code. Technical Report TR-2001-46, University of Texas at Austin, December 2001.
 
15
 
16
 
17
 
18
Intel. Intel Itanium processor reference manual for software optimization, November 2001.
 
19
 
20
J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. Newsletter of the IEEE Technical Committee on Computer Architecture, December 1995. http://tab.computer.org/tcca/NEWS/DEC95/DEC95.HTM.
 
21
J. D. McCalpin. STREAM: Measuring sustainable memory bandwidth in high performance computers, 1995. http://www.cs.virginia.edu/stream.
22
23
24
 
25
 
26
K. Remington and R. Pozo. NIST Sparse BLAS: User's Guide. Technical report, NIST, 1996. gams.nist.gov/spblas.
 
27
 
28
 
29
 
30
S. Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
 
31
32

CITED BY  10

Collaborative Colleagues:
Richard Vuduc: colleagues
James W. Demmel: colleagues
Katherine A. Yelick: colleagues
Shoaib Kamil: colleagues
Rajesh Nishtala: colleagues
Benjamin Lee: colleagues