|
ABSTRACT
In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include a) a more detailed cost model that includes ILP and 1-cache considerations, b) a new code generation algorithm for unrolling nested loops that generates more compact code (with fewer remainder loops) than the unroll-and-jam transformation, and c) a new algorithm for efficiently enumerating feasible unroll vectors.Our experimental results confirm the wide applicability of our approach by showing a 2.2X speedup on matrix multiply, and an average 1.08X speedup on seven of the SPEC95fp benchmarks (with a 1.2X speedup for two benchmarks). These speedups are significant because the baseline compiler used for comparison is the IBM XL Fortran product compiler which generates high quality code with unrolling and software pipelining of innermost loops enabled. Larger performance improvements due to unrolling of nested loops can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Michael J. Alexander, Mark W. Barley, Bruce R.. Childers, Jack W. Davidson, and Sanjay Jinturkar. Memory bandwidth optimizations for wide-bus machines. Proceedings of the ~fith Hawaii International Conference on System Sciences, Wailea, Hawaii, pages 466-475, January 1993.
|
| |
2
|
F. E. Allen and J. Cocke. A catalogue of optimizing transformations. In Design and Optimization of Compilers, pages 1-30. Prentice-Hall, 1972.
|
 |
3
|
|
| |
4
|
Mauricio Breternitz, Michael Lai, Vivek Sarkar, and Barbara Simons. Compiler Solutions for the Stale-Data and False-Sharing Problems. Technical report, IBM Santa Teresa Laboratory, April 1993. TR 03.466.
|
 |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
|
| |
9
|
The Standard Performance Evaluation Corporation. SPEC CPU95 Benchmarks. http://open.specbench.org/osg/cpu95/, 1997.
|
| |
10
|
|
| |
11
|
J. J. Dongarra and A. R. Hinds. Unrolling Loops in Fortran. Software - Practice and Exper/ence, 9(3):219- 226, March 1979.
|
| |
12
|
|
 |
13
|
Joseph A. Fisher , John R. Ellis , John C. Ruttenberg , Alexandru Nicolau, Parallel processing: a smart compiler and a dumb machine, Proceedings of the 1984 SIGPLAN symposium on Compiler construction, p.37-47, June 17-22, 1984, Montreal, Canada
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
Vivek Sarkar and Barbara Simons. Don't Waste Those Cycles: An In-Depth Look at Scheduling Instructions in Basic Blocks and Loops. Video Lecture in University Video Communication's Distinguished Lecture Series IX, August 1994.
|
 |
22
|
|
 |
23
|
Bogong Su , Shiyuan Ding , Jian Wang , Jinshi Xia, GURPR—a method for global software pipelining, Proceedings of the 20th annual workshop on Microprogramming, p.88-96, December 01-04, 1987, Colorado Springs, Colorado, United States
[doi> 10.1145/255305.255322]
|
 |
24
|
|
 |
25
|
|
| |
26
|
|
|