ACM Home Page
Please provide us with feedback. Feedback
An accurate cost model for guiding data locality transformations
Full text PdfPdf (2.50 MB)
Source ACM Transactions on Programming Languages and Systems (TOPLAS) archive
Volume 27 ,  Issue 5  (September 2005) table of contents
Pages: 946 - 987  
Year of Publication: 2005
ISSN:0164-0925
Authors
Xavier Vera  Mälardalens Högskola, Västerås, Sweden
Jaume Abella  Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain
Josep Llosa  Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain
Antonio González  Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 54,   Citation Count: 0
Additional Information:

abstract   references   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1086642.1086646
What is a DOI?

ABSTRACT

Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality.The performance of the memory hierarchy can be improved by means of data and loop transformations. Tiling is a loop transformation that aims at reducing capacity misses by shortening the reuse distance. Padding is a data layout transformation targeted to reduce conflict misses.This article presents an accurate cost model that describes misses across different hierarchy levels and considers the effects of other hardware components such as branch predictors. The cost model drives the application of tiling and padding transformations. We combine the cost model with a genetic algorithm to compute the tile and pad factors that enhance the program performance.To validate our strategy, we ran experiments for a set of benchmarks on a large set of modern architectures. Our results show that this scheme is useful to optimize programs' performance. When compared to previous approaches, we observe that with a reasonable compile-time overhead, our approach gives significant performance improvements for all studied kernels on all architectures.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Belady, L. A. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J.
 
3
Bermudo, N., Vera, X., González, A., and Llosa, J. 2000. An efficient solver for cache miss equations. In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'00). IEEE Computer Society Press, Los Alamitos, CA.
4
 
5
6
7
8
9
 
10
 
11
 
12
Fernández, A. 1999. A quantitative analysis of the SPECfp95. Tech. Rep. UPC-DAC-1999-12, Universitat Politècnica de Catalunya. March.
 
13
14
 
15
Gill, P. E., Murray, W., and Wright, M. H. 1981. Practical optimization. Academic Press, Orlando, FL.
 
16
 
17
 
18
Hansen, P., Jaumard, B., and Mathon, V. 1995. Constrained nonlinear 0-1 programming. ORSA J. Comput.
 
19
 
20
Horst, R., Pardalos, P. M., and Thoai, N. V. 1995. Introduction to Global Optimization. Kluwer Academic Publishers.
 
21
 
22
Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220.
23
 
24
McFarling, S. 1993. Combining branch predictors. Tech. Rep. TN-36, Digital Western Researh Lab.
25
26
 
27
28
29
30
31
 
32
Rivera, G. and Tseng, C.-W. 1999a. A comparison of compiler tiling algorithms. In Proceedings of the 8th International Conference on Compiler Construction (CC'99).
33
 
34
35
 
36
 
37
38
 
39
40
 
41
Wolfe, M. 1996. Advanced loop interchanging. In Proceedings of International Conference on Parallel Processing (ICPP'96).


REVIEW

"Mitesh R Meswani : Reviewer"

Today, well-tuned systems are able to achieve 30 percent peak central processing unit (CPU) utilization. The memory subsystem continues to be the bottleneck for achieving peak performance. The increase in the gap between speedup curves of CPU and   more...

Collaborative Colleagues:
Xavier Vera: colleagues
Jaume Abella: colleagues
Josep Llosa: colleagues
Antonio González: colleagues