|
ABSTRACT
As the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be important to keep multiple processors busy. The default storage representations of both C and Fortran, row- and column-major respectively, have fundamental deficiencies with many matrix computations. By switching the storage representation from cartesian to block indices, one is able to take better advantage of cache locality at all levels from L1 to paging. This paper only changes storage representation from row-major to Morton-hybrid, and applies it to matrix multiplication. Its purpose is to show that, even with only traditional iterative algorithms, simply changing storage representation offers significant speedups.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
Bader, M., and Zenger, C. Cache oblivious matrix multiplication using an element ordering based on the Peano curve. In Parallel Processing and Applied Mathematics (Berlin, 2006), vol. 3911 of Lecture Notes in Comput. Sci., Springer, pp. 1042--1049. http://dx.doi.org/10.1007/11752578_126
|
| |
4
|
|
 |
5
|
|
| |
6
|
Fox, G. C. A graphical approach to load balancing and sparse matrix-vector multiplication. In Numerical Algorithms for Modern Parallel Architectures, M. Schultz, Ed., vol. 13 of IMA Vol. in Math. & Appl. Springer, New York, 1988, pp. 37--61.
|
 |
7
|
Basilio B. Fraguela , Jia Guo , Ganesh Bikshandi , María J. Garzarán , Gheorghe Almási , José Moreira , David Padua, The Hierarchically Tiled Arrays programming approach, Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems, p.1-12, October 22-23, 2004, Houston, Texas
[doi> 10.1145/1066650.1066657]
|
| |
8
|
|
| |
9
|
Gabriel, S. T., Chenoweth, B., Lorton, K. P., Carlson, M., and Wise, D. S. The Opie Compiler Distribution. Indiana University, Bloomington, IN, Apr. 2005. http://www/cs.indiana.edu/~dswise/Opie/distribution.html
|
 |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
Goto, K., and van de Geijn, R. On reducing TLB misses in matrix multiplication. FLAME Working Note 9, Univ. of Texas, Austin, Nov. 2002. http://www.cs.utexas.edu/users/flame/pubs/GOTO.ps.gz
|
| |
14
|
Goto, K., and van de Geijn, R. A. Anatomy of high-performance matrix multiplication. Tech. rep., Univ. of Texas, Austin. Submittted for publication. Visited Sept. 2006. http://www.cs.utexas.edu/users/flame/pubs/GOTO_TOMS.pdf
|
| |
15
|
Innovative Computing Laboratory, Univ. of Tennessee. Performance Application Programming Interface (PAPI). Knoxville, TN, Dec. 2005. http://icl.cs.utk.edu/papi/
|
| |
16
|
Johnson, D. S. A theoretician's guide to the experimental analysis of algorithms. In Data Structures, Near Neighbor Searches, and Methodology: 5th & 6th DIMACS Implementation Challenges, M. H. Goldwasser, D. S. Johnson, and C. C. McGeoch, Eds., vol. 59 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci. Amer. Math. Soc, Providence, 2002, pp. 215--250. http://www.research.att.com/~dsj/papers.html
|
| |
17
|
|
| |
18
|
Markoff, J. Writing the fastest code, by hand, for fun: A human computer keeps speeding up chips. The New York Times CLV, 53, 412 (2005 Nov. 28), C1, C6. http://www.nytimes.com/2005/11/28/technology/28super.html
|
| |
19
|
Morton, G. M. A computer oriented geodetic data base and a new technique in file sequencing. Tech. rep., IBM Ltd., Ottawa, Ontario, Mar. 1966.
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
Valsalam, V., and Skjellum, A. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concur. Comp. Prac. Exper. 14, 10 (2002), 805--839. http://dx.doi.org/10.1002/cpe.630
|
| |
25
|
|
| |
26
|
Wise, D. S., Citro, C. L., Hursey, J. J., Liu, F., and Rainey, M. A. A paradigm for parallel matrix algorithms: Scalable Cholesky. In Euro-Par 2005 --- Parallel Processing, J. C. Cunha and P. D. Medeiros, Eds., no. 3648 in Lecture Notes in Comput. Sci. Springer, Berlin, Aug. 2005, pp. 687--698. http://dx.doi.org/10.1007/11549468_76
|
 |
27
|
|
CITED BY
|
|
Aydin Buluç , Jeremy T. Fineman , Matteo Frigo , John R. Gilbert , Charles E. Leiserson, Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks, Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, August 11-13, 2009, Calgary, AB, Canada
|
|