|
ABSTRACT
Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality.The performance of the memory hierarchy can be improved by means of data and loop transformations. Tiling is a loop transformation that aims at reducing capacity misses by shortening the reuse distance. Padding is a data layout transformation targeted to reduce conflict misses.This article presents an accurate cost model that describes misses across different hierarchy levels and considers the effects of other hardware components such as branch predictors. The cost model drives the application of tiling and padding transformations. We combine the cost model with a genetic algorithm to compute the tile and pad factors that enhance the program performance.To validate our strategy, we ran experiments for a set of benchmarks on a large set of modern architectures. Our results show that this scheme is useful to optimize programs' performance. When compared to previous approaches, we observe that with a reasonable compile-time overhead, our approach gives significant performance improvements for all studied kernels on all architectures.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Belady, L. A. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J.
|
| |
3
|
Bermudo, N., Vera, X., González, A., and Llosa, J. 2000. An efficient solver for cache miss equations. In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'00). IEEE Computer Society Press, Los Alamitos, CA.
|
 |
4
|
Michael Butler , Tse-Yu Yeh , Yale Patt , Mitch Alsup , Hunter Scales , Michael Shebanow, Single instruction stream parallelism is greater than two, Proceedings of the 18th annual international symposium on Computer architecture, p.276-286, May 27-30, 1991, Toronto, Ontario, Canada
|
| |
5
|
|
 |
6
|
Steve Carr , Kathryn S. McKinley , Chau-Wen Tseng, Compiler optimizations for improving data locality, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.252-262, October 05-07, 1994, San Jose, California, United States
|
 |
7
|
Siddhartha Chatterjee , Vibhor V. Jain , Alvin R. Lebeck , Shyam Mundhra , Mithuna Thottethodi, Nonlinear array layouts for hierarchical memory systems, Proceedings of the 13th international conference on Supercomputing, p.444-453, June 20-25, 1999, Rhodes, Greece
[doi> 10.1145/305138.305231]
|
 |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
Fernández, A. 1999. A quantitative analysis of the SPECfp95. Tech. Rep. UPC-DAC-1999-12, Universitat Politècnica de Catalunya. March.
|
| |
13
|
|
 |
14
|
|
| |
15
|
Gill, P. E., Murray, W., and Wright, M. H. 1981. Practical optimization. Academic Press, Orlando, FL.
|
| |
16
|
|
| |
17
|
|
| |
18
|
Hansen, P., Jaumard, B., and Mathon, V. 1995. Constrained nonlinear 0-1 programming. ORSA J. Comput.
|
| |
19
|
|
| |
20
|
Horst, R., Pardalos, P. M., and Thoai, N. V. 1995. Introduction to Global Optimization. Kluwer Academic Publishers.
|
| |
21
|
|
| |
22
|
Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220.
|
 |
23
|
Monica D. Lam , Edward E. Rothberg , Michael E. Wolf, The cache performance and optimizations of blocked algorithms, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.63-74, April 08-11, 1991, Santa Clara, California, United States
|
| |
24
|
McFarling, S. 1993. Combining branch predictors. Tech. Rep. TN-36, Digital Western Researh Lab.
|
 |
25
|
|
 |
26
|
|
| |
27
|
|
 |
28
|
Todd C. Mowry , Monica S. Lam , Anoop Gupta, Design and evaluation of a compiler algorithm for prefetching, Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, p.62-73, October 12-15, 1992, Boston, Massachusetts, United States
|
 |
29
|
|
 |
30
|
|
 |
31
|
|
| |
32
|
Rivera, G. and Tseng, C.-W. 1999a. A comparison of compiler tiling algorithms. In Proceedings of the 8th International Conference on Compiler Construction (CC'99).
|
 |
33
|
|
| |
34
|
|
 |
35
|
|
| |
36
|
|
| |
37
|
|
 |
38
|
|
| |
39
|
|
 |
40
|
|
| |
41
|
Wolfe, M. 1996. Advanced loop interchanging. In Proceedings of International Conference on Parallel Processing (ICPP'96).
|
REVIEW
"Mitesh R Meswani : Reviewer"
Today, well-tuned systems are able to achieve 30 percent peak central processing unit (CPU) utilization. The memory subsystem continues to be the bottleneck for achieving peak performance. The increase in the gap between speedup curves of CPU and
more...
|