|
ABSTRACT
As the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be important to keep multiple processors busy. The default storage representations of both C and FORTRAN, row- and column-major respectively, have fundamental deficiencies with many matrix computations. By switching the storage representation from cartesian to block indices, one is able to take better advantage of cache locality at all levels from LI to paging. This paper only changes storage representation from row-major to Morton-hybrid, and applies it to matrix multiplication. Its purpose is to show that, even with only traditional iterative algorithms, simply changing storage representation offers significant speedups.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
M. Bader and C. Zenger. Cache oblivious matrix multiplication using an element ordering based on the Peano curve. In Parallel Processing and Applied Mathematics, volume 3911 of Lecture Notes in Comput. Sci., pages 1042--1049, Berlin, 2006.Springer. http://dx.doi.org/10.1007/11752578_126
|
| |
3
|
|
 |
4
|
|
| |
5
|
G. C. Fox. A graphical approach to load balancing and sparse matrix-vector multiplication. In M. Schultz, editor, Numerical Algorithms for Modern Parallel Architectures, volume 13 of IMA Vol. in Math. & Appl., pages 37--61. Springer, New York, 1988.
|
 |
6
|
Basilio B. Fraguela , Jia Guo , Ganesh Bikshandi , María J. Garzarán , Gheorghe Almási , José Moreira , David Padua, The Hierarchically Tiled Arrays programming approach, Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems, p.1-12, October 22-23, 2004, Houston, Texas
[doi> 10.1145/1066650.1066657]
|
| |
7
|
|
| |
8
|
S. T. Gabriel, B. Chenoweth, K. P. Lorton, M. Carlson, and D. S. Wise. The Opie Compiler Distribution. Indiana University, Bloomington, IN, Apr. 2005. http://www.cs.indina.edu/~dswise/Opie/distribution.html
|
 |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
K. Goto and R. van de Geijn. On reducing TLB misses in matrix multiplication. FLAME Working Note 9, Univ. of Texas, Austin, Nov. 2002. http://www.cs.utexas.edu/users/flame/pubs/GOTO.ps.gz
|
| |
13
|
K. Goto and R. A. van de Geijn. Anatomy of high-performance matrix multiplication. Technical report, Univ. of Texas, Austin. Submitted for publication. Visited Sept. 2006. http://www.cs.utexas.edu/users/flame/pubs/GOTO_TOMS.pdf
|
| |
14
|
Innovative Computing Laboratory, Univ. of Tennessee, Knoxville, TN. Performance Application Programming Interface (PAPI), Dec. 2005. http://icl.cs.utk.edu/papi/
|
| |
15
|
D. S. Johnson. A theoretician's guide to the experimental analysis of algorithms. In M. H. Goldwasser, D. S. Johnson, and C. C. McGeoch, editors, Data Structures, Near Neighbor Searches, and Methodology: 5th & 6th DIMACS Implementation Challenges, volume 59 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci., pages 215--250. Amer. Math. Soc., Providence, 2002. http://www.research.att.com/~dsj/papers.html
|
| |
16
|
|
| |
17
|
J. Markoff. Writing the fastest code, by hand, for fun: A human computer keeps speeding up chips. The New York Times, CLV (53,412):Cl, C6, 2005 Nov. 28. http://www.nytimes.com/2005/11/28/technology/28super.html
|
| |
18
|
G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, Mar. 1966.
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
V. Valsalam and A. Skjellum. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concur. Comp. Prac. Exper., 14(10):805--839, 2002. http://dx.doi.org/10.1002/cpe.630
|
| |
24
|
|
| |
25
|
D. S. Wise, C. L. Citro, J. J. Hursey, F. Liu, and M. A. Rainey. A paradigm for parallel matrix algorithms: Scalable Cholesky. In J. C. Cunha and P. D. Medeiros, editors, Euro-Par 2005 - Parallel Processing, number 3648 in Lecture Notes in Comput. Sci., pages 687--698. Springer, Berlin, Aug. 2005. http://dx.doi.org/10.1107/11549468_76
|
 |
26
|
|
|