ACM Home Page
Please provide us with feedback. Feedback
Optimized unrolling of nested loops
Full text PdfPdf (1.10 MB)
Source International Conference on Supercomputing archive
Proceedings of the 14th international conference on Supercomputing table of contents
Santa Fe, New Mexico, United States
Pages: 153 - 166  
Year of Publication: 2000
ISBN:1-58113-270-0
Author
Vivek Sarkar  IBM Research, Thomas J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY
Sponsor
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 92,   Citation Count: 6
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/335231.335246
What is a DOI?

ABSTRACT

In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include a) a more detailed cost model that includes ILP and 1-cache considerations, b) a new code generation algorithm for unrolling nested loops that generates more compact code (with fewer remainder loops) than the unroll-and-jam transformation, and c) a new algorithm for efficiently enumerating feasible unroll vectors.Our experimental results confirm the wide applicability of our approach by showing a 2.2X speedup on matrix multiply, and an average 1.08X speedup on seven of the SPEC95fp benchmarks (with a 1.2X speedup for two benchmarks). These speedups are significant because the baseline compiler used for comparison is the IBM XL Fortran product compiler which generates high quality code with unrolling and software pipelining of innermost loops enabled. Larger performance improvements due to unrolling of nested loops can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Michael J. Alexander, Mark W. Barley, Bruce R.. Childers, Jack W. Davidson, and Sanjay Jinturkar. Memory bandwidth optimizations for wide-bus machines. Proceedings of the ~fith Hawaii International Conference on System Sciences, Wailea, Hawaii, pages 466-475, January 1993.
 
2
F. E. Allen and J. Cocke. A catalogue of optimizing transformations. In Design and Optimization of Compilers, pages 1-30. Prentice-Hall, 1972.
3
 
4
Mauricio Breternitz, Michael Lai, Vivek Sarkar, and Barbara Simons. Compiler Solutions for the Stale-Data and False-Sharing Problems. Technical report, IBM Santa Teresa Laboratory, April 1993. TR 03.466.
5
 
6
7
 
8
 
9
The Standard Performance Evaluation Corporation. SPEC CPU95 Benchmarks. http://open.specbench.org/osg/cpu95/, 1997.
 
10
 
11
J. J. Dongarra and A. R. Hinds. Unrolling Loops in Fortran. Software - Practice and Exper/ence, 9(3):219- 226, March 1979.
 
12
13
14
 
15
 
16
 
17
18
 
19
 
20
 
21
Vivek Sarkar and Barbara Simons. Don't Waste Those Cycles: An In-Depth Look at Scheduling Instructions in Basic Blocks and Loops. Video Lecture in University Video Communication's Distinguished Lecture Series IX, August 1994.
22
23
24
25
 
26