|
ABSTRACT
Iterative stencil loops are used in scientific programs to implement relaxation methods for numerical simulation and signal processing. Such loops iteratively modify the same array elements over different time steps, which presents opportunities for the compiler to improve the temporal data locality through loop tiling. This article presents a compiler framework for automatic tiling of iterative stencil loops, with the objective of improving the cache performance. The article first presents a technique which allows loop tiling to satisfy data dependences in spite of the difficulty created by imperfectly nested inner loops. It does so by skewing the inner loops over the time steps and by applying a uniform skew factor to all loops at the same nesting level. Based on a memory cost analysis, the article shows that the skew factor must be minimized at every loop level in order to minimize cache misses. A graph-theoretical algorithm, which takes polynomial time, is presented to determine the minimum skew factor. Furthermore, the memory-cost analysis derives the tile size which minimizes capacity misses. Given the tile size, an efficient and general <i>array-padding</i> scheme is applied to remove conflict misses. Experiments were conducted on 16 test programs and preliminary results showed an average speedup of 1.58 and a maximum speedup of 5.06 across those test programs.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Admas, J. C. 1999. MUDPACK: Multigrid Software for Elliptic Partial Differential Equations. Available on line at http://www.scd.ucar.edu/css/software/mudpack/.
|
 |
2
|
Nawaaz Ahmed , Nikolay Mateev , Keshav Pingali, Synthesizing transformations for locality enhancement of imperfectly-nested loop nests, Proceedings of the 14th international conference on Supercomputing, p.141-152, May 08-11, 2000, Santa Fe, New Mexico, United States
[doi> 10.1145/335231.335245]
|
| |
3
|
Ravindra K. Ahuja , Thomas L. Magnanti , James B. Orlin, Network flows: theory, algorithms, and applications, Prentice-Hall, Inc., Upper Saddle River, NJ, 1993
|
 |
4
|
|
 |
5
|
|
| |
6
|
Andersen, B. S., Gustavson, F. G., Wasniewski, J., and Yalamov, P. Y. 1999. Recursive formulation of some dense linear algebra algorithms. In Proceedings of the SIAM Conference on Parallel Processing for Scientific Computing (San Antonio, TX).
|
 |
7
|
Jennifer M. Anderson , Saman P. Amarasinghe , Monica S. Lam, Data and computation transformations for multiprocessors, Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.166-178, July 19-21, 1995, Santa Barbara, California, United States
|
| |
8
|
David F. Bacon , Jyh-Herng Chow , Dz-ching R. Ju , Kalyan Muthukumar , Vivek Sarkar, A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness, Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research, p.3, October 31-November 03, 1994, Toronto, Ontario, Canada
|
| |
9
|
|
| |
10
|
Boulet, P., Dongarra, J., Robert, Y., and Vivien, F. 1999. Static tiling for heterogeneous computing platforms. Parall. Comput. 25, 547--568.
|
 |
11
|
P. Briggs , K. D. Cooper , K. Kennedy , L. Torczon, Coloring heuristics for register allocation, Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation, p.275-284, June 19-23, 1989, Portland, Oregon, United States
|
 |
12
|
Doug Burger , James R. Goodman , Alain Kägi, Memory bandwidth limitations of future microprocessors, Proceedings of the 23rd annual international symposium on Computer architecture, p.78-89, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
13
|
|
 |
14
|
Siddhartha Chatterjee , Vibhor V. Jain , Alvin R. Lebeck , Shyam Mundhra , Mithuna Thottethodi, Nonlinear array layouts for hierarchical memory systems, Proceedings of the 13th international conference on Supercomputing, p.444-453, June 20-25, 1999, Rhodes, Greece
[doi> 10.1145/305138.305231]
|
 |
15
|
Siddhartha Chatterjee , Alvin R. Lebeck , Praveen K. Patnala , Mithuna Thottethodi, Recursive array layouts and fast parallel matrix multiplication, Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, p.222-231, June 27-30, 1999, Saint Malo, France
[doi> 10.1145/305619.305645]
|
 |
16
|
D. Cociorva , J. W. Wilkins , C. Lam , G. Baumgartner , J. Ramanujam , P. Sadayappan, Loop optimization for a class of memory-constrained computations, Proceedings of the 15th international conference on Supercomputing, p.103-113, June 2001, Sorrento, Italy
[doi> 10.1145/377792.377814]
|
 |
17
|
|
| |
18
|
Collard, J.-F. 1994. Space-time transformation of while-loops using speculative execution. In Proceedings of the Scalable High Performance Computing Conference (Knoxville, TN). 429--436.
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
 |
23
|
Somnath Ghosh , Margaret Martonosi , Sharad Malik, Precise miss analysis for program transformations with caches of arbitrary associativity, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.228-239, October 02-07, 1998, San Jose, California, United States
|
 |
24
|
Junjie Gu , Zhiyuan Li , Gyungho Lee, Experience with efficient array data flow analysis for array privatization, Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.157-167, June 18-21, 1997, Las Vegas, Nevada, United States
|
| |
25
|
Haghighat, M. R. 1990. Symbolic dependence analysis for high performance parallelizing compilers. Ph.D. dissertation. Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL.
|
| |
26
|
|
 |
27
|
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
 |
31
|
Induprakas Kodukula , Nawaaz Ahmed , Keshav Pingali, Data-centric multi-level blocking, Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, p.346-357, June 16-18, 1997, Las Vegas, Nevada, United States
|
| |
32
|
|
 |
33
|
Monica D. Lam , Edward E. Rothberg , Michael E. Wolf, The cache performance and optimizations of blocked algorithms, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.63-74, April 08-11, 1991, Santa Clara, California, United States
|
| |
34
|
|
| |
35
|
Matula, D. and Beck, L. 1981. Smallest-last ordering and clustering and graph coloring algorithms. Tech. rep. TR CSE 8104. Department of Computer Science and Engineering, Southern Methodist University, Dallas, TX.
|
| |
36
|
|
| |
37
|
|
| |
38
|
Object-Oriented Scientific Computing. 2001. Blitz++. Object-Oriented Scientific Computing, Available online at http://www.oonumerics.org/blitz/benchmarks/.
|
 |
39
|
|
| |
40
|
|
| |
41
|
|
 |
42
|
|
| |
43
|
|
| |
44
|
|
| |
45
|
|
| |
46
|
|
| |
47
|
|
| |
48
|
|
 |
49
|
|
 |
50
|
|
 |
51
|
Michelle Mills Strout , Larry Carter , Jeanne Ferrante , Beth Simon, Schedule-independent storage mapping for loops, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.24-33, October 02-07, 1998, San Jose, California, United States
|
 |
52
|
O. Temam , C. Fricker , W. Jalby, Cache interference phenomena, Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems, p.261-271, May 16-20, 1994, Nashville, Tennessee, United States
|
 |
53
|
David S. Wise , Jeremy D. Frens , Yuhong Gu , Gregory A. Alexander, Language support for Morton-order matrices, Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming, p.24-33, June 2001, Snowbird, Utah, United States
|
| |
54
|
|
| |
55
|
|
| |
56
|
|
| |
57
|
|
CITED BY 7
|
|
Shoaib Kamil , Parry Husbands , Leonid Oliker , John Shalf , Katherine Yelick, Impact of modern memory subsystems on cache optimizations for stencil computations, Proceedings of the 2005 workshop on Memory system performance, June 12-12, 2005, Chicago, Illinois
|
|
|
Samuel Williams , John Shalf , Leonid Oliker , Shoaib Kamil , Parry Husbands , Katherine Yelick, The potential of the cell processor for scientific computing, Proceedings of the 3rd conference on Computing frontiers, May 03-05, 2006, Ischia, Italy
|
|
|
Samuel Williams , John Shalf , Leonid Oliker , Shoaib Kamil , Parry Husbands , Katherine Yelick, Scientific computing Kernels on the cell processor, International Journal of Parallel Programming, v.35 n.3, p.263-298, June 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|