|
ABSTRACT
Previous research has used program transformation to introduce parallelism and to exploit data locality. Unfortunately, these two objectives have usually been considered independently. This work explores the trade-offs between effectively utilizing parallelism and memory hierarchy on shared-memory multiprocessors. We present a simple, but surprisingly accurate, memory model to determine cache line reuse from both multiple accesses to the same memory location and from consecutive memory access. The model is used in memory optimizing and loop parallelization algorithms that effectively exploit data locality and parallelism in concert. We demonstrate the efficacy of this approach with very encouraging experimental results.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
ACK87
|
|
 |
AK84
|
|
 |
AK87
|
|
| |
AS79
|
|
| |
Ban90a
|
|
| |
Ban90b
|
U. Banerjee. Unimodular transformations of double loops. In Proceedings of the Third Workshop on Languages and Compilers }or Parallel Computing, Irvine, CA, August 1990.
|
| |
BFKK92
|
Vasanth Balasundaram , Geoffrey Fox , Ken Kennedy , Ulrich Kremer, A static performance estimator in the Fortran D programming system, Languages, compilers and run-time environments for distributed memory machines, Elsevier Science Publishers B. V., Amsterdam, The Netherlands, 1992
|
| |
Cal87
|
|
 |
CCK90
|
|
 |
CKPK90
|
George Cybenko , Lyle Kipp , Lynn Pointer , David Kuck, Supercomputer performance evaluation and the Perfect Benchmarks, Proceedings of the 4th international conference on Supercomputing, p.254-266, June 11-15, 1990, Amsterdam, The Netherlands
|
| |
DBMS79
|
J. Dongarra, J. Bunch, C. Moler, and G. Stewart. LINPACK U~er's Guide. SIAM Fublications, Philadelphia, PA, 1979.
|
 |
DCHH88
|
|
| |
FST91
|
|
| |
GJG88
|
|
 |
IT88
|
|
 |
KKP+ 81
|
D. J. Kuck , R. H. Kuhn , D. A. Padua , B. Leasure , M. Wolfe, Dependence graphs and compiler optimizations, Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, p.207-218, January 26-28, 1981, Williamsburg, Virginia
[doi> 10.1145/567532.567555]
|
| |
KMC72
|
D. Kuck, Y. Muraoka, and S. Chen. On the mlmber of operations simultaneously executable in Fortran-like programs and their resulting speedup. IEEE Transactions on Computers, C-21(12):1293-1310, December 1972.
|
| |
KMM91
|
K. Kennedy, N. McIntosh, and K. S. McKinley. Static performance estimation in a parallelizin& compiler. Technical Report TR91-174, Dept. of Computer Science, Rice University, December 1991.
|
| |
KMT92
|
K. Kennedy, K. S. MCKinley, and C. Tseng. hnproving data locality. Technical Report TR92-179, Dept. of Computer Science, Rice University, March 1992.
|
 |
Lam74
|
|
 |
LRW91
|
Monica D. Lam , Edward E. Rothberg , Michael E. Wolf, The cache performance and optimizations of blocked algorithms, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.63-74, April 08-11, 1991, Santa Clara, California, United States
|
| |
McK92
|
|
| |
McM86
|
F. McMahon. The Livermore Fortran Kernels: A computer test of the numerical performance range. Technical Report UCRL-53745, Lawrence Livermore National Laboratory, 1986.
|
| |
Por89
|
|
| |
WB87
|
|
| |
WL90
|
M.E. Wolf azld M. Lain. Maximizing parallelism via loop transformations. In Proceedings of the Third Workshop on Languages and Compzlers for Parallel Computsng, Irvine, CA, August 1990.
|
 |
WL91
|
|
 |
Wol89a
|
|
| |
Wol89b
|
|
CITED BY 50
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
D. Cociorva , J. W. Wilkins , C. Lam , G. Baumgartner , J. Ramanujam , P. Sadayappan, Loop optimization for a class of memory-constrained computations, Proceedings of the 15th international conference on Supercomputing, p.103-113, June 2001, Sorrento, Italy
|
|
|
|
|
|
|
|
|
Alok Choudhary , Geoffrey Fox , Seema Hiranandani , Ken Kennedy , Charles Koelbel , Sanjay Ranka , Chau-Wen Tseng, Unified compilation of Fortran 77D and 90D, ACM Letters on Programming Languages and Systems (LOPLAS), v.2 n.1-4, p.95-114, March–Dec. 1993
|
|
|
Chau-Wen Tseng , Jennifer M. Anderson , Saman P. Amarasinghe , Monica S. Lam, Unified compilation techniques for shared and distributed address space machines, Proceedings of the 9th international conference on Supercomputing, p.67-76, July 03-07, 1995, Barcelona, Spain
|
|
|
|
|
|
|
|
|
|
|
|
Nawaaz Ahmed , Nikolay Mateev , Keshav Pingali, Synthesizing transformations for locality enhancement of imperfectly-nested loop nests, Proceedings of the 14th international conference on Supercomputing, p.141-152, May 08-11, 2000, Santa Fe, New Mexico, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Seema Hiranandani , Ken Kennedy , Chau-Wen Tseng, Evaluation of compiler optimizations for Fortran D on MIMD distributed memory machines, Proceedings of the 6th international conference on Supercomputing, p.1-14, July 19-24, 1992, Washington, D. C., United States
|
|
|
|
|
|
J. Ramanujam , Jinpyo Hong , Mahmut Kandemir , A. Narayan, Reducing memory requirements of nested loops for embedded systems, Proceedings of the 38th conference on Design automation, p.359-364, June 2001, Las Vegas, Nevada, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Zhao Yudi , Hu Changjun , Wang Shengyuan , Zhang Suqin, An extended OpenMP targeting on the hybrid architecture of SMP-cluster, Proceedings of the 2nd IASTED international conference on Advances in computer science and technology, p.50-54, January 23-25, 2006, Puerto Vallarta, Mexico
|
|
|
|
|
|
|
|
|
|
|
|
J. S. Hu , N. Vijaykrishnan , S. Kim , M. Kandemir , M. J. Irwin, Scheduling Reusable Instructions for Power Reduction, Proceedings of the conference on Design, automation and test in Europe, p.10148, February 16-20, 2004
|
|
|
|
|
|
Per Gunnar Kjeldsberg , Francky Catthoor , Sven Verdoolaege , Martin Palkovic , Arnout Vandecappelle , Qubo Hu , Einar J. Aas, Guidance of Loop Ordering for Reduced Memory Usage in Signal Processing Applications, Journal of Signal Processing Systems, v.53 n.3, p.301-321, December 2008
|
|