| Performance optimizations and bounds for sparse matrix-vector multiply |
| Full text |
Pdf
(867 KB)
|
| Source
|
Conference on High Performance Networking and Computing
archive
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
table of contents
Baltimore, Maryland
Pages: 1 - 35
Year of Publication: 2002
|
|
Authors
|
|
Richard Vuduc
|
University of California, Berkeley, Berkeley, California
|
|
James W. Demmel
|
University of California, Berkeley, Berkeley, California
|
|
Katherine A. Yelick
|
University of California, Berkeley, Berkeley, California
|
|
Shoaib Kamil
|
University of California, Berkeley, Berkeley, California
|
|
Rajesh Nishtala
|
University of California, Berkeley, Berkeley, California
|
|
Benjamin Lee
|
University of California, Berkeley, Berkeley, California
|
|
| Sponsors |
|
| Publisher |
IEEE Computer Society Press
Los Alamitos, CA, USA
|
| Bibliometrics |
Downloads (6 Weeks): 12, Downloads (12 Months): 58, Citation Count: 10
|
|
|
ABSTRACT
We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM × V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits.Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM × V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on a test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on a class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present a new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM × V performance (Mflop/s) by as much as 2.5× over an untuned implementation.Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM × V, will come from two sources: (1) consideration of higher-level matrix structures (e.g., exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g., sparse matrix-multiple vector multiply, multiplication of ATA by a vector).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadelphia, PA, 1994.
|
| |
2
|
|
 |
3
|
Jeff Bilmes , Krste Asanovic , Chee-Whye Chin , Jim Demmel, Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology, Proceedings of the 11th international conference on Supercomputing, p.340-347, July 07-11, 1997, Vienna, Austria
[doi> 10.1145/263580.263662]
|
| |
4
|
S. Blackford, G. Corliss, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, C. Hu, W. Kahan, L. Kaufman, B. Kearfott, F. Krogh, X. Li, Z. Maany, A. Petitet, R. Pozo, K. Remington, W. Walster, C. Whaley, and J. W. von Gudenberg. Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001. www.netlib.org/blast.
|
| |
5
|
Ronald F. Boisvert , Roldan Pozo , Karin Remington , Richard F. Barrett , Jack J. Dongarra, Matrix market: a web resource for test matrix collections, Proceedings of the IFIP TC2/WG2.5 working conference on Quality of numerical software: assessment and enhancement, p.125-137, January 1997, Oxford, United Kingdom
|
| |
6
|
S. Browne , J. Dongarra , N. Garner , K. London , P. Mucci, A scalable cross-platform infrastructure for application performance tuning using hardware counters, Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), p.42-es, November 04-10, 2000, Dallas, Texas, United States
|
| |
7
|
|
 |
8
|
Siddhartha Chatterjee , Erin Parker , Philip J. Hanlon , Alvin R. Lebeck, Exact analysis of the cache behavior of nested loops, Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, p.286-297, June 2001, Snowbird, Utah, United States
|
| |
9
|
T. Davis. UF Sparse Matrix Collection. www.cise.ufl.edu/research/sparse/matrices.
|
| |
10
|
B. B. Fraguela, R. Doallo, and E. L. Zapata. Memory hierarchy performance prediction for sparse blocked algorithms. Parallel Processing Letters, 9(3), March 1999.
|
 |
11
|
|
| |
12
|
W. D. Gropp, D. K. Kasushik, D. E. Keyes, and B. F. Smith. Towards realistic bounds for implicit CFD codes. In Proceedings of Parallel Computational Fluid Dynamics, pages 241--248, 1999.
|
| |
13
|
G. Heber, A. J. Dolgert, M. Alt, K. A. Mazurkiewicz, and L. Stringer. Fracture mechanics on the Intel Itanium architecture: A case study. In Workshop on EPIC Architectures and Compiler Technology (ACM MICRO 34), Austin, TX, December 2001.
|
| |
14
|
G. M. Henry. Flexible, high-performance matrix multiply via a self-modifying runtime code. Technical Report TR-2001-46, University of Texas at Austin, December 2001.
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
Intel. Intel Itanium processor reference manual for software optimization, November 2001.
|
| |
19
|
|
| |
20
|
J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. Newsletter of the IEEE Technical Committee on Computer Architecture, December 1995. http://tab.computer.org/tcca/NEWS/DEC95/DEC95.HTM.
|
| |
21
|
J. D. McCalpin. STREAM: Measuring sustainable memory bandwidth in high performance computers, 1995. http://www.cs.virginia.edu/stream.
|
 |
22
|
|
 |
23
|
Juan J. Navarro , Elena García-Diego , Josep-L. Larriba-Pey , Toni Juan, Block algorithms for sparse matrix computations on high performance workstations, Proceedings of the 10th international conference on Supercomputing, p.301-308, May 25-28, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/237578.237624]
|
 |
24
|
|
| |
25
|
|
| |
26
|
K. Remington and R. Pozo. NIST Sparse BLAS: User's Guide. Technical report, NIST, 1996. gams.nist.gov/spblas.
|
| |
27
|
|
| |
28
|
|
| |
29
|
|
| |
30
|
S. Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
|
| |
31
|
|
 |
32
|
|
|