|
ABSTRACT
Software pipelining is a critical optimization for producing efficient code for VLIW/EPIC and superscalar processors in high-performance embedded applications such as digital signal processing. Software thread integration (STI) can often improve the performance of looping code in cases where software pipelining performs poorly or fails. This paper examines both situations, presenting methods to determine what and when to integrate.We evaluate our methods on C-language image and digital signal processing libraries and synthetic loop kernels. We compile them for a very long instruction word (VLIW) digital signal processor (DSP) -- the Texas Instruments (TI) C64x architecture. Loops which benefit little from software pipelining (SWP-Poor) speed up by 26% (harmonic mean, HM). Loops for which software pipelining fails (SWP-Fail) due to conditionals and calls speed up by 16% (HM). Combining SWP-Good and SWP-Poor loops leads to a speedup of 55% (HM).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
J. R. Allen , Ken Kennedy , Carrie Porterfield , Joe Warren, Conversion of control dependence to data dependence, Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, p.177-189, January 24-26, 1983, Austin, Texas
[doi> 10.1145/567067.567085]
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
K. D. Cooper, M. W. Hall, and K. Kennedy. A methodology for procedure cloning. Computer Languages, 19(2):105--117, 1993.
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
 |
11
|
Jeffrey Dean , Craig Chambers , David Grove, Selective specialization for object-oriented languages, Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, p.93-102, June 18-21, 1995, La Jolla, California, United States
|
 |
12
|
|
| |
13
|
|
 |
14
|
Michael I. Gordon , William Thies , Michal Karczmarek , Jasper Lin , Ali S. Meli , Andrew A. Lamb , Chris Leger , Jeremy Wong , Henry Hoffmann , David Maze , Saman Amarasinghe, A stream compiler for communication-exposed architectures, Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, October 05-09, 2002, San Jose, California
|
| |
15
|
E. Granston, R. Scales, E. Stotzer, A. Ward, and J. Zbiciak. Controlling code size of software-pipelined loops on the TMS320C6000 VLIW DSP architecture. In Proceedings of the 3rd Workshop on Media and Stream Processors, Dec. 2001.
|
| |
16
|
N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data-flow programming language LUSTRE. Proceedings of the IEEE, 79(9):1305--1320, September 1991.
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
| |
20
|
Brucek Khailany , William J. Dally , Ujval J. Kapasi , Peter Mattson , Jinyung Namkoong , John D. Owens , Brian Towles , Andrew Chang , Scott Rixner, Imagine: Media Processing with Streams, IEEE Micro, v.21 n.2, p.35-46, March 2001
[doi> 10.1109/40.918001]
|
 |
21
|
|
| |
22
|
|
| |
23
|
M. Narayanan and K. A. Yelick. Generating permutation instructions from a high-level description. In Proceedings of the 6th Workshop on Media and Streaming Processors, 2004.
|
| |
24
|
A. Nene, S. Talla, B. Goldberg, and R. Rabbah. Trimaran - an infrastructure for compiler research in instruction-level parallelism - user manual. New York University, 1998.
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
| |
28
|
|
| |
29
|
R. Stephens. A survey of stream processing. Acta Informatica, 34(7):491--541, 1997.
|
| |
30
|
|
 |
31
|
Eric Stotzer , Ernst Leiss, Modulo scheduling for the TMS320C6x VLIW DSP architecture, Proceedings of the ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded systems, p.28-34, May 05-05, 1999, Atlanta, Georgia, United States
|
 |
32
|
Bogong Su , Shiyuan Ding , Jian Wang , Jinshi Xia, GURPR—a method for global software pipelining, Proceedings of the 20th annual workshop on Microprogramming, p.88-96, December 01-04, 1987, Colorado Springs, Colorado, United States
[doi> 10.1145/255305.255322]
|
| |
33
|
Michael Bedford Taylor , Jason Kim , Jason Miller , David Wentzlaff , Fae Ghodrat , Ben Greenwald , Henry Hoffman , Paul Johnson , Jae-Wook Lee , Walter Lee , Albert Ma , Arvind Saraf , Mark Seneski , Nathan Shnidman , Volker Strumpen , Matt Frank , Saman Amarasinghe , Anant Agarwal, The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs, IEEE Micro, v.22 n.2, p.25-35, March 2002
[doi> 10.1109/MM.2002.997877]
|
| |
34
|
Texas Instruments. Code Composer Studio User's Guide (Rev. B), Mar. 2000.
|
| |
35
|
Texas Instruments. TMS320C6000 CPU and Instruction Set Reference Guide, Sept. 2000.
|
| |
36
|
Texas Instruments. TMS320C64x Technical Overview, Jan. 2001.
|
| |
37
|
Texas Instruments. TMS320C64x DSP Library Programmer's Reference, Apr. 2002.
|
| |
38
|
Texas Instruments. TMS320C64x Image/Video Processing Library Programmer's Reference, Apr. 2002.
|
| |
39
|
Texas Instruments. TMS320C6000 DSP Peripherals Overview Reference Guide (Rev. G), Sept. 2004.
|
| |
40
|
|
 |
41
|
Nancy J. Warter , Grant E. Haab , Krishna Subramanian , John W. Bockhaus, Enhanced modulo scheduling for loops with conditional branches, Proceedings of the 25th annual international symposium on Microarchitecture, p.170-179, December 01-04, 1992, Portland, Oregon, United States
|
 |
42
|
Nancy J. Warter , Scott A. Mahlke , Wen-Mei W. Hwu , B. Ramakrishna Rau, Reverse If-Conversion, Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation, p.290-299, June 21-25, 1993, Albuquerque, New Mexico, United States
|
| |
43
|
|
|