ACM Home Page
Please provide us with feedback. Feedback
Analyzing block locality in Morton-order and Morton-hybrid matrices
Full text PdfPdf (277 KB)
Source Memory Performance: Dealing With Applications, Systems And Architecture; Vol. 260 archive
Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures table of contents
Seattle, Washington
Pages: 5 - 12  
Year of Publication: 2006
ISBN:1-59593-568-1
Authors
K. Patrick Lorton  Indiana University, Bloomington, IN
David S. Wise  Indiana University, Bloomington, IN
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 23,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1166133.1166134
What is a DOI?

ABSTRACT

As the architectures of computers change, introducing more caches onto multicore chips, even more locality becomes necessary. With the bandwidth between caches and RAM now even more valuable, additional locality from new matrix representations will be important to keep multiple processors busy. The default storage representations of both C and FORTRAN, row- and column-major respectively, have fundamental deficiencies with many matrix computations. By switching the storage representation from cartesian to block indices, one is able to take better advantage of cache locality at all levels from LI to paging. This paper only changes storage representation from row-major to Morton-hybrid, and applies it to matrix multiplication. Its purpose is to show that, even with only traditional iterative algorithms, simply changing storage representation offers significant speedups.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
M. Bader and C. Zenger. Cache oblivious matrix multiplication using an element ordering based on the Peano curve. In Parallel Processing and Applied Mathematics, volume 3911 of Lecture Notes in Comput. Sci., pages 1042--1049, Berlin, 2006.Springer. http://dx.doi.org/10.1007/11752578_126
 
3
4
 
5
G. C. Fox. A graphical approach to load balancing and sparse matrix-vector multiplication. In M. Schultz, editor, Numerical Algorithms for Modern Parallel Architectures, volume 13 of IMA Vol. in Math. & Appl., pages 37--61. Springer, New York, 1988.
6
 
7
 
8
S. T. Gabriel, B. Chenoweth, K. P. Lorton, M. Carlson, and D. S. Wise. The Opie Compiler Distribution. Indiana University, Bloomington, IN, Apr. 2005. http://www.cs.indina.edu/~dswise/Opie/distribution.html
9
10
 
11
 
12
K. Goto and R. van de Geijn. On reducing TLB misses in matrix multiplication. FLAME Working Note 9, Univ. of Texas, Austin, Nov. 2002. http://www.cs.utexas.edu/users/flame/pubs/GOTO.ps.gz
 
13
K. Goto and R. A. van de Geijn. Anatomy of high-performance matrix multiplication. Technical report, Univ. of Texas, Austin. Submitted for publication. Visited Sept. 2006. http://www.cs.utexas.edu/users/flame/pubs/GOTO_TOMS.pdf
 
14
Innovative Computing Laboratory, Univ. of Tennessee, Knoxville, TN. Performance Application Programming Interface (PAPI), Dec. 2005. http://icl.cs.utk.edu/papi/
 
15
D. S. Johnson. A theoretician's guide to the experimental analysis of algorithms. In M. H. Goldwasser, D. S. Johnson, and C. C. McGeoch, editors, Data Structures, Near Neighbor Searches, and Methodology: 5th & 6th DIMACS Implementation Challenges, volume 59 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci., pages 215--250. Amer. Math. Soc., Providence, 2002. http://www.research.att.com/~dsj/papers.html
 
16
 
17
J. Markoff. Writing the fastest code, by hand, for fun: A human computer keeps speeding up chips. The New York Times, CLV (53,412):Cl, C6, 2005 Nov. 28. http://www.nytimes.com/2005/11/28/technology/28super.html
 
18
G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, Mar. 1966.
 
19
 
20
 
21
 
22
 
23
V. Valsalam and A. Skjellum. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concur. Comp. Prac. Exper., 14(10):805--839, 2002. http://dx.doi.org/10.1002/cpe.630
 
24
 
25
D. S. Wise, C. L. Citro, J. J. Hursey, F. Liu, and M. A. Rainey. A paradigm for parallel matrix algorithms: Scalable Cholesky. In J. C. Cunha and P. D. Medeiros, editors, Euro-Par 2005 - Parallel Processing, number 3648 in Lecture Notes in Comput. Sci., pages 687--698. Springer, Berlin, Aug. 2005. http://dx.doi.org/10.1107/11549468_76
26


Collaborative Colleagues:
K. Patrick Lorton: colleagues
David S. Wise: colleagues