|
ABSTRACT
In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In the this article, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments illustrate that for kernels our model and algorithm can select and achieve the best loop structure for a nest. For over 30 complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve bacause benchmark programs typically have high hit rates even for small data caches; however, our optimizations significanty improved several programs.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
 |
8
|
|
| |
9
|
|
| |
10
|
CARR, S. AND WU, Q. 1995. An analysis of loop permutation on the HP PA-RISC. Tech. Rep. TR95-03, Michigan Technological Univ., Houghton, Mich. Feb.
|
 |
11
|
|
| |
12
|
COOPER, K., HALL, M. W., HOOD, R. T., KENNEDY, K., MCKINLEY, K. S., MELLOR-CRUMMEY, J. M., TORCZON, L., AND WARREN, S. K. 1993. The ParaScope parallel programming environmerit. Proc. IEEE 81, 2 (Feb.), 244-263.
|
| |
13
|
COOPER, K., HALL, M. W., AND KENNEDY, K. 1993. A methodology for procedure cloning. Comput. Lang. 19, 2 (Feb.), 105-117.
|
| |
14
|
|
| |
15
|
|
 |
16
|
Gina Goff , Ken Kennedy , Chau-Wen Tseng, Practical dependence testing, Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation, p.15-29, June 24-28, 1991, Toronto, Ontario, Canada
|
 |
17
|
Mary W. Hall , Ken Kennedy , Kathryn S. McKinley, Interprocedural transformations for parallel code generation, Proceedings of the 1991 ACM/IEEE conference on Supercomputing, p.424-434, November 18-22, 1991, Albuquerque, New Mexico, United States
[doi> 10.1145/125826.126055]
|
 |
18
|
|
 |
19
|
|
| |
20
|
|
| |
21
|
KENNEDY, K., MCKINLEY, K. S., AND TSENG, C.-W. 1993. Analysis and transformation in an interactive parallel programming tool. Concurrency Pract. Exper. 5, 7 (Oct.), 575-602.
|
 |
22
|
D. J. Kuck , R. H. Kuhn , D. A. Padua , B. Leasure , M. Wolfe, Dependence graphs and compiler optimizations, Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, p.207-218, January 26-28, 1981, Williamsburg, Virginia
[doi> 10.1145/567532.567555]
|
 |
23
|
Monica D. Lam , Edward E. Rothberg , Michael E. Wolf, The cache performance and optimizations of blocked algorithms, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.63-74, April 08-11, 1991, Santa Clara, California, United States
|
 |
24
|
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
 |
28
|
|
| |
29
|
WOLFE, M. J. 1986. Advanced loop interchanging. In Proceedings of the 1986 International Conference on Parallel Processing. CRC Press, Boca Raton, Fla.
|
| |
30
|
|
| |
31
|
WOLFE, M. J. 1991. The Tiny loop restructuring research tool. In Proceedings of the 1991 International Conference on Parallel Processing. CRC Press, Boca Raton, Fla.
|
CITED BY 130
|
|
|
|
|
|
|
|
|
|
|
M. Kandemir , A. Choudhary , J. Ramanujam , M. Kandaswamy, A unified compiler algorithm for optimizing locality, parallelism and communication in out-of-core computations, Proceedings of the fifth workshop on I/O in parallel and distributed systems, p.79-92, November 17-17, 1997, San Jose, California, United States
|
|
|
|
|
|
|
|
|
Gerald Roth , John Mellor-Crummey , Ken Kennedy , R. Gregg Brickner, Compiling stencils in high performance Fortran, Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), p.1-20, November 15-21, 1997, San Jose, CA
|
|
|
|
|
|
D. Cociorva , J. W. Wilkins , C. Lam , G. Baumgartner , J. Ramanujam , P. Sadayappan, Loop optimization for a class of memory-constrained computations, Proceedings of the 15th international conference on Supercomputing, p.103-113, June 2001, Sorrento, Italy
|
|
|
|
|
|
|
|
|
Richard Vuduc , James W. Demmel , Katherine A. Yelick , Shoaib Kamil , Rajesh Nishtala , Benjamin Lee, Performance optimizations and bounds for sparse matrix-vector multiply, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, p.1-35, November 16, 2002, Baltimore, Maryland
|
|
|
|
|
|
M. Kandemir , A. Choudhary , N. Shenoy , P. Banerjee , J. Ramanujam, A hyperplane based approach for optimizing spatial locality in loop nests, Proceedings of the 12th international conference on Supercomputing, p.69-76, July 1998, Melbourne, Australia
|
|
|
Somnath Ghosh , Margaret Martonosi , Sharad Malik, Automated cache optimizations using CME driven diagnosis, Proceedings of the 14th international conference on Supercomputing, p.316-326, May 08-11, 2000, Santa Fe, New Mexico, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
M. Kandemir , P. Banerjee , A. Choudhary , J. Ramanujam , E. Ayguadé, An integer linear programming approach for optimizing cache locality, Proceedings of the 13th international conference on Supercomputing, p.500-509, June 20-25, 1999, Rhodes, Greece
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
P. R. Panda , F. Catthoor , N. D. Dutt , K. Danckaert , E. Brockmeyer , C. Kulkarni , A. Vandercappelle , P. G. Kjeldsberg, Data and memory optimization techniques for embedded systems, ACM Transactions on Design Automation of Electronic Systems (TODAES), v.6 n.2, p.149-206, April 2001
|
|
|
|
|
|
|
|
|
Kamen Yotov , Xiaoming Li , Gang Ren , Michael Cibulskis , Gerald DeJong , Maria Garzaran , David Padua , Keshav Pingali , Paul Stodghill , Peng Wu, A comparison of empirical and model-driven optimization, ACM SIGPLAN Notices, v.38 n.5, May 2003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
P. Unnikrishnan , G. Chen , M. Kandemir , D. R. Mudgett, Dynamic compilation for energy adaptation, Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design, p.158-163, November 10-14, 2002, San Jose, California
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
I. Kadayif , M. Kandemir , G. Chen , N. Vijaykrishnan , M. J. Irwin , A. Sivasubramaniam, Compiler-directed high-level energy estimation and optimization, ACM Transactions on Embedded Computing Systems (TECS), v.4 n.4, p.819-850, November 2005
|
|
|
Meilin Liu , Qingfeng Zhuge , Zili Shao , Edwin H.-M. Sha, General loop fusion technique for nested loops considering timing and code size, Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, September 22-25, 2004, Washington DC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mahmut Kandemir , Alok Choudhary , J. Ramanujam , Meenakshi A. Kandaswamy, A Unified Framework for Optimizing Locality, Parallelism, and Communication in Out-of-Core Computations, IEEE Transactions on Parallel and Distributed Systems, v.11 n.7, p.648-668, July 2000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Qing Yi , Ken Kennedy , Haihang You , Keith Seymour , Jack Dongarra, Automatic blocking of QR and LU factorizations for locality, Proceedings of the 2004 workshop on Memory system performance, June 08-08, 2004, Washington, D.C.
|
|
|
Keith D. Cooper , Alexander Grosul , Timothy J. Harvey , Steven Reeves , Devika Subramanian , Linda Torczon , Todd Waterman, ACME: adaptive compilation made efficient, ACM SIGPLAN Notices, v.40 n.7, July 2005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
J. Dongarra , G. Bosilca , Z. Chen , V. Eijkhout , G. E. Fagg , E. Fuentes , J. Langou , P. Luszczek , J. Pjesivac-Grbovic , K. Seymour , H. You , S. S. Vadhiyar, Self-adapting numerical software (SANS) effort, IBM Journal of Research and Development, v.50 n.2/3, p.223-238, March 2006
|
|
|
Sandhya Krishnan , Sriram Krishnamoorthy , Gerald Baumgartner , Chi-Chung Lam , J. Ramanujam , P. Sadayappan , Venkatesh Choppella, Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver, Journal of Parallel and Distributed Computing, v.66 n.5, p.659-673, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Doosan Cho , Ilya Issenin , Nikil Dutt , Jonghee W. Yoon , Yunheung Paek, Software controlled memory layout reorganization for irregular array access patterns, Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems, September 30-October 03, 2007, Salzburg, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
P. Marchal , J. I. Gomez , L. Pinuel , D. Bruni , L. Benini , F. Catthoor , H. Corporaal, SDRAM-Energy-Aware Memory Allocation for Dynamic Multi-Media Applications on Multi-Processor Platforms, Proceedings of the conference on Design, Automation and Test in Europe, p.10516, March 03-07, 2003
|
|
|
Tushar Mohan , Bronis R. de Supinski , Sally A. McKee , Frank Mueller , Andy Yoo , Martin Schulz, Identifying and Exploiting Spatial Regularity in Data Memory References, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p.49, November 15-21, 2003
|
|
|
|
|
|
|
|
|
|
|
|
Per Gunnar Kjeldsberg , Francky Catthoor , Sven Verdoolaege , Martin Palkovic , Arnout Vandecappelle , Qubo Hu , Einar J. Aas, Guidance of Loop Ordering for Reduced Memory Usage in Signal Processing Applications, Journal of Signal Processing Systems, v.53 n.3, p.301-321, December 2008
|
|
|
Mohammed Fellahi , Albert Cohen , Sid Touati, Code-size conscious pipelining of imperfectly nested loops, Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture, p.49-55, September 16-16, 2007, Brasov, Romania
|
|
|
Javed Absar , Min Li , Praveen Raghavan , Andy Lambrechts , Murali Jayapala , Arnout Vandecappelle , Francky Catthoor, Locality optimization in wireless applications, Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis, September 30-October 03, 2007, Salzburg, Austria
|
|
|
|
|
|
Van Bui , Boyana Norris , Kevin Huck , Lois Curfman McInnes , Li Li , Oscar Hernandez , Barbara Chapman, A component infrastructure for performance and power modeling of parallel scientific applications, Proceedings of the 2008 compFrame/HPC-GECO workshop on Component based high performance, October 16-17, 2008, Karlsruhe, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xiaoming Gu , Ian Christopher , Tongxin Bai , Chengliang Zhang , Chen Ding, A component model of spatial locality, Proceedings of the 2009 international symposium on Memory management, June 19-20, 2009, Dublin, Ireland
|
|
|
Doosan Cho , Sudeep Pasricha , Ilya Issenin , Nikil D. Dutt , Minwook Ahn , Yunheung Paek, Adaptive scratch pad memory management for dynamic behavior of multimedia applications, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, v.28 n.4, p.554-567, April 2009
|
|
|
|
INDEX TERMS
Primary Classification:
D.
Software
D.3
PROGRAMMING LANGUAGES
D.3.4
Processors
Subjects:
Optimization
Additional Classification:
D.
Software
D.3
PROGRAMMING LANGUAGES
D.3.4
Processors
Subjects:
Compilers
General Terms:
Algorithms,
Languages,
Performance
Keywords:
Cache,
compiler optimization,
data locality,
loop distribution,
loop fusion,
loop permutation,
loop reversal,
loop transformations,
microprocessors,
simulation
REVIEW
"Noah S. Prywes : Reviewer"
When using a processor with a cache memory of limited size, the
order of memory referencing in the program can affect the overall
computation time. It would be too difficult for a programmer to consider
this effect in composing a program. Inst
more...
|