|
ABSTRACT
Embedded systems require maximum performance from a processor within significant constraints in power consumption and chip cost. Using software pipelining, high-performance digital signal processors can often exploit considerable instruction-level parallelism (ILP), and thus significantly improve performance. However, software pipelining, in some instances, hinders the goals of low power consumption and low chip cost. Specifically, the registers required by a software pipelined loop may exceed the size of the physical register set.The register pressure problem incurred by software pipelining makes it difficult to build a high-performance embedded processor with a single, multi-ported register bank with enough registers to support high levels of ILP while maintaining clock speed and limiting power consumption. The large number of ports required to support a single register bank severely hampers access time. The port requirement for a register bank can be reduced via hardware by partitioning the register bank into multiple banks connected to disjoint subsets of functional units, called clusters. Since a functional unit is not directly connected to all register banks, wasted energy and resources can result due to delays incurred when accessing "non-local" registers.The overhead due to partitioning of the register set can be ameliorated by using high-level compiler loop optimization techniques such as unrolling, unroll-and-jam and fusion. High-level loop optimizations spread data-independent parallelism across clusters that may not require "non-local" register accesses and can provide work to hide the latency of any such register accesses that are needed.In this paper, we examine the effects of loop fusion on DSP loops run on four simulated, clustered VLIW architectures and the Texas Instruments TMS320C64x. Our experiments show a 1.3 -- 2 harmonic mean speedup.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Vikram S. Adve , John Mellor-Crummey , Mark Anderson , Jhy-Chun Wang , Daniel A. Reed , Ken Kennedy, An integrated compilation and performance analysis environment for data parallel programs, Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), p.50-es, December 04-08, 1995, San Diego, California, United States
[doi> 10.1145/224170.224340]
|
 |
2
|
|
 |
3
|
|
| |
4
|
|
| |
5
|
R. Allen and K. Kennedy. Advanced compilation for vector and parallel computers. Morgan Kaufmann Publishers, San Mateo CA
|
 |
6
|
P. Briggs , K. D. Cooper , K. Kennedy , L. Torczon, Coloring heuristics for register allocation, Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation, p.275-284, June 19-23, 1989, Portland, Oregon, United States
|
 |
7
|
|
| |
8
|
J. R. Ellis. A Compiler for VLIW Architectures. PhD thesis, Yale University, 1984
|
| |
9
|
|
| |
10
|
|
 |
11
|
Xianglong Huang , Steve Carr , Philip Sweany, Loop Transformations for Architectures with Partitioned Register Banks, Proceedings of the ACM SIGPLAN workshop on Languages, compilers and tools for embedded systems, p.48-55, August 2001, Snow Bird, Utah, United States
|
| |
12
|
|
| |
13
|
D. Kuras, S. Carr, and P. Sweany. Value cloning for architectures with partitioned register banks. In The 1998 Worshop on Compiler and Architecture Support for Embedded Systems, Washington D.C., December 1998
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
D. Poplawski. The unlimited resource machine (URM). Technical Report 95-01, Michigan Technological University, Jan. 1995
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
P. H. Sweany and S. J. Beaty. Overview of the Rocket retargetable C compiler. Technical Report CS-94-01, Department of Computer Science, Michigan Technological University, Houghton, January 1994
|
| |
22
|
Texas Instruments. TMS320C6000 CPU and Instruction Set Reference Guide, 2000. literature number SPRU189
|
| |
23
|
Texas Instruments. TMS320C6000 Optimizing Compiler User's Guide, 2000. literature number SPRU187
|
CITED BY 8
|
|
Meilin Liu , Qingfeng Zhuge , Zili Shao , Edwin H.-M. Sha, General loop fusion technique for nested loops considering timing and code size, Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, September 22-25, 2004, Washington DC, USA
|
|
|
|
|
|
|
|
|
Youcef Bouchebaba , Bruno Girodias , Gabriela Nicolescu , El Mostapha Aboulhamid , Bruno Lavigueur , Pierre Paulin, MPSoC memory optimization using program transformation, ACM Transactions on Design Automation of Electronic Systems (TODAES), v.12 n.4, p.43-es, September 2007
|
|
|
|
|
|
Min Li , Bruno Bougard , Weiyu Xu , David Novo , Liesbet Van Der Perre , Francky Catthoor, Optimizing near-ML MIMO detector for SDR baseband on parallel programmable architectures, Proceedings of the conference on Design, automation and test in Europe, March 10-14, 2008, Munich, Germany
|
|
|
Meikang Qiu , Edwin H. -M. Sha , Meilin Liu , Man Lin , Shaoxiong Hua , Laurence T. Yang, Energy minimization with loop fusion and multi-functional-unit scheduling for multidimensional DSP, Journal of Parallel and Distributed Computing, v.68 n.4, p.443-455, April, 2008
|
|
|
Min Li , David Novo , Bruno Bougard , Trevor Carlson , Liesbet Van Der Perre , Francky Catthoor, Generic multiphase software pipelined partial FFT on instruction level parallel architectures, IEEE Transactions on Signal Processing, v.57 n.4, p.1604-1615, April 2009
|
|