ACM Home Page
Please provide us with feedback. Feedback
Optimal loop parallelization for maximizing iteration-level parallelism
Full text PdfPdf (773 KB)
Source
International Conference on Compilers, Architecture and Synthesis for Embedded Systems archive
Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems table of contents
Grenoble, France
SESSION: Parallelism, streams, and spilling table of contents
Pages 67-76  
Year of Publication: 2009
ISBN:978-1-60558-626-7
Authors
Duo Liu  The Hong Kong Polytechnic University, Hung Hom, Hong Kong
Zili Shao  The Hong Kong Polytechnic University, Hung Hom, Hong Kong
Meng Wang  The Hong Kong Polytechnic University, Hung Hom, Hong Kong
Minyi Guo  Shanghai Jiao Tong University, Shanghai, China
Jingling Xue  The University of New South Wales, Sydney, Australia
Sponsors
SIGDA: ACM Special Interest Group on Design Automation
ACM: Association for Computing Machinery
SIGBED: ACM Special Interest Group on Embedded Systems
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 14,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1629395.1629407
What is a DOI?

ABSTRACT

This paper solves the open problem of extracting the maximal number of iterations from a loop that can be executed in parallel on chip multiprocessors. Our algorithm solves it optimally by migrating the weights of parallelism-inhibiting dependences on dependence cycles in two phases. First, we model dependence migration with retiming and formulate this classic loop parallelization into a graph optimization problem, i.e., one of finding retiming values for its nodes so that the minimum non-zero edge weight in the graph is maximized. We present our algorithm in three stages with each being built incrementally on the preceding one. Second, the optimal code for a loop is generated from the retimed graph of the loop found in the first phase. We demonstrate the effectiveness of our optimal algorithm by comparing with a number of representative non-optimal algorithms using a set of benchmarks frequently used in prior work.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. Aiken and A. Nicolau. Optimal loop parallelization. ACM SIGPLAN Notices, 23(7):308--317, 1988.
 
2
R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, first edition, 2001.
 
3
L. Bic, J. M. A. Roy, and M. Nagel. Exploiting iteration-level parallelism in dataflow programs. In Proceedings of the 12th International Conference on Distributed Computing Systems, pages 376--381, June 1992.
 
4
W.-L. Chang, C.-P. Chu, and M. Ho. Exploitation of parallelism to nested loops with dependence cycles. Journal of Systems Architecture, 50(12):729--742, 2004.
 
5
L.-F. Chao. Scheduling and Behavioral Transformations for Parallel Systems. PhD thesis, Dept. of Computer Science, Princeton University, 1993.
 
6
L.-F. Chao, A. S. LaPaugh, and E. H.-M. Sha. Rotation scheduling: a loop pipelining algorithm. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 16(3):229--239, March 1997.
 
7
L.-F. Chao and E. H.-M. Sha. Static scheduling of uniform nested loops. In Proceedings of the 7th International Parallel Processing Symposium, pages 254--258, April 1993.
 
8
D.-K. Chen and P.-C. Yew. Statement re-ordering for doacross loops. In Proceedings of the 1994 International Conference on Parallel Processing (ICPP '94), volume 2, pages 24--28, August 1994.
 
9
C.-P. Chu and D. L. Carver. Reordering the statements with dependence cycles to improve the performance of parallel loops. In Proceedings of the 1997 International Conference on Parallel and Distributed Systems (ICPADS '97), pages 322--328, 1997.
 
10
R. G. Cytron. Compile-time Scheduling and Optimizations for Multiprocessor System. PhD thesis, University of Illinois at Urbana--Champaign, Champaign, IL, USA, September 1984.
 
11
A. Darte and G. Huard. Complexity of multi-dimensional loop alignment. In Proceedings of the 19th Annual Symposium on Theoretical Aspects of Computer Science (STACS '02), pages 179--191, 2002.
 
12
A. Fraboulet and A. Mignotte. Source code loop transformations for memory hierarchy optimizations. In Proceedings of the Workshop on Memory Access Decoupled Architecture (MEDEA '01), pages 8--12, Barcelona, Spain, September 2001.
 
13
D. A. P. Haiek. Multiprocessors: Discussion of some Theoretical and Practical Problems. PhD thesis, Deptartment of Computer Science, University of Illinois at Urbana-Champaign, November 1979.
 
14
D. N. Jayasimha and J. D. Martens. Some architectural and compilation issues in the design of hierarchical shared-memory multiprocessors. In Proceedings of the 6th International Parallel Processing Symposium, pages 567--572, March 1992.
 
15
C. E. Leiserson1 and J. B. Saxe. Retiming synchronous circuitry. Algorithmica, 6:5--35, 1991.
 
16
L.-S. Liu, C.-W. Ho, and J.-P. Sheu. On the parallelism of nested for-loops using index shift method. In Proceedings of the International Conference on Parallel Processing (ICPP '90), pages 119--123, 1990.
 
17
M. Liu, Q. Zhuge, Z. Shao, and E. H.-M. Sha. General loop fusion technique for nested loops considering timing and code size. In Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems (CASES '04), pages 190--201, September 2004.
 
18
N. Manjikian and T. S. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193--209, February 1997.
 
19
K. Okuda. Cycle shrinking by dependence reduction. In Proceedings of the Second International Euro-Par Conference on Parallel Processing (Euro-Par '96), pages 398--401, 1996.
 
20
D. Padua, D. Kuck, and D. Lawrie. High-speed multiprocessors and compilation techniques. IEEE Transactions on Computers, C--29(9):763--776, September 1980.
 
21
L. Passos and E. H.-M. Sha. Achieving full parallelism using multi-dimensional retiming. IEEE Transactions on Parallel and Distributed Systems, 7(11):1150--1163, November 1996.
 
22
N. L. Passos and E. H.-M. Sha. Full parallelism in uniform nested loops using multi-dimensional retiming. In Proceedings of the 1994 International Conference on Parallel Processing (ICPP '94), pages 130--133, August 1994.
 
23
C. D. Polychronopoulos. Advanced loop optimizations for parallel computers. In Proceedings of the 1st International Conference on Supercomputing, pages 255--277, March 1988.
 
24
C. D. Polychronopoulos. Compiler optimizations for enhancing parallelism and their impact on architecture design. IEEE Transactions on Computers, 37(8):991--1004, August 1988.
 
25
Y. Robert and S. W. Song. Revisiting cycle shrinking. Parallel Computing, 18(5):481--496, 1992.
 
26
E. H.-M. Sha, C. Lang, and N. L. Passos. Polynomial-time nested loop fusion with full parallelism. In Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing (ICPP '96), volume 3, page 9, August 1996.
 
27
W. Shang, M. T. O'Keefe, and J. A. B. Fortes. On loop transformations for generalized cycle shrinking. IEEE Transactions on Parallel and Distributed Systems, 5(2):193--204, February 1994.
 
28
L. Solve. http://lpsolve.sourceforge.net/5.5/. 2009.
 
29
X. Tang and G. R. Gao. Automatically partitioning threads for multithreaded architectures. Journal of Parallel and Distributed Computing, 58(2):159--189, 1999.
 
30
C.-M. Wang and S.-D. Wang. Compiler techniques to extract parallelism within a nested loop. In Proceedings of the Fifteenth Annual International Computer Software and Applications Conference (COMPSAC '91), pages 24--29, Tokyo, Japan, September 1991.
 
31
M. E. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4):452--471, 1991.
 
32
C. Xue, Z. Shao, M. Liu, and E. H.-M. Sha. Iterational retiming: maximize iteration-level parallelism for nested loops. In Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis (CODES+ISSS '05), pages 309--314, September 2005.
 
33
J. Xue, M. Guo, and D. Wei. Improving the parallelism of iterative methods by aggressive loop fusion. Journal of Supercomputing, 43(2):147--164, February 2008.
 
34
J. Xue, Q. Huang, and M. Guo. Enabling loop fusion and tiling for cache performance by fixing fusion-preventing data dependences. In Proceedings of the 2005 International Conference on Parallel Processing (ICPP '05), pages 107--115, 2005.