ACM Home Page
Please provide us with feedback. Feedback
Combining analytical and empirical approaches in tuning matrix transposition
Full text PdfPdf (274 KB)
Source PACT archive
Proceedings of the 15th international conference on Parallel architectures and compilation techniques table of contents
Seattle, Washington, USA
SESSION: Application-specific optimizations table of contents
Pages: 233 - 242  
Year of Publication: 2006
ISBN:1-59593-264-X
Authors
Qingda Lu  The Ohio State University, Columbus, OH
Sriram Krishnamoorthy  The Ohio State University, Columbus, OH
P. Sadayappan  The Ohio State University, Columbus, OH
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 30,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1152154.1152190
What is a DOI?

ABSTRACT

Matrix transposition is an important kernel used in many applications. Even though its optimization has been the subject of many studies, an optimization procedure that targets the characteristics of current processor architectures has not been developed. In this paper, we develop an integrated optimization framework that addresses a number of issues, including tiling for the memory hierarchy, effective handling of memory misalignment, utilizing memory subsystem characteristics, and the exploitation of the parallelism provided by the vector instruction sets in current processors. A judicious combination of analytical and empirical approaches is used to determine the most appropriate optimizations. The absence of problem information until execution time is handled by generating multiple versions of the code - the best version is chosen at runtime, with assistance from minimal-overhead inspectors. The approach highlights aspects of empirical optimization that are important for similar computations with little temporal reuse. Experimental results on PowerPC G5 and Intel Pentium 4 demonstrate the effectiveness of the developed framework.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
 
5
S. Chatterjee and S. Sen. Cache-efficient matrix transposition. In HPCA '00: Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, pages 195--205, 2000.
 
6
 
7
8
 
9
 
10
 
11
12
 
13
14
 
15
J. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society TCCA Newsletter, Dec. 1995.
16
 
17
 
18
R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.
 
19
 
20
K. Yotov, X. Li, G. Ren, M.J.S.Garzaran, D. Padua, K. Pingali, and P. Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2):358--386, 2005.
21
22
 
23

Collaborative Colleagues:
Qingda Lu: colleagues
Sriram Krishnamoorthy: colleagues
P. Sadayappan: colleagues