|
ABSTRACT
Energy, power, and area efficiency are critical design concerns for embedded processors. Much of the energy of a typical embedded processor is consumed in the front-end since instruction fetching happens on nearly every cycle and involves accesses to large memory arrays such as instruction and branch target caches. The use of small front-end arrays leads to significant power and area savings, but typically results in significant performance degradation. This paper evaluates and compares optimizations that improve the performance of embedded processors with small front-end caches. We examine both software techniques, such as instruction re-ordering and selective caching, and hardware techniques, such as instruction prefetching, tagless instruction cache, and unified caches for instruction and branch targets. We demonstrate that, building on top of a block-aware instruction set, these optimizations can eliminate the performance degradation due to small front-end caches. Moreover, selective combinations of these optimizations lead to an embedded processor that performs significantly better than the large cache design while maintaining the area and energy efficiency of the small cache design.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
 |
4
|
|
| |
5
|
D. Burger and T. M. Austin. Simplescalar Tool Set, Version 2.0. Technical Report CS-TR-97-1342, University of Wisconsin, Madison, June 1997.
|
| |
6
|
|
 |
7
|
Kanad Ghose , Milind B. Kamble, Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation, Proceedings of the 1999 international symposium on Low power electronics and design, p.70-75, August 16-17, 1999, San Diego, California, United States
[doi> 10.1145/313817.313860]
|
 |
8
|
|
| |
9
|
Intel Corporation. Intel Itanium Architecture Software Developers Manual. Revision 2.0, December 2001.
|
| |
10
|
Intel Corporation. Intel PXA27x Processor Family Developer's Manual, October 2004.
|
| |
11
|
|
 |
12
|
|
| |
13
|
Johnson Kin , Munish Gupta , William H. Mangione-Smith, The filter cache: an energy efficient memory structure, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.184-193, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
 |
14
|
Lea Hwang Lee , Bill Moyer , John Arends, Instruction fetch energy reduction using loop caches for embedded applications with small tight loops, Proceedings of the 1999 international symposium on Low power electronics and design, p.267-269, August 16-17, 1999, San Diego, California, United States
[doi> 10.1145/313817.313944]
|
 |
15
|
|
 |
16
|
|
 |
17
|
|
 |
18
|
|
| |
19
|
Gi-Ho Park , Oh-Young Kwon , Tack-Don Han , Shin-Dug Kim , Sung-Bong Yang, An improved lookahead instruction prefetching, Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97, p.712, April 28-May 02, 1997
|
 |
20
|
|
| |
21
|
|
| |
22
|
Michael D. Powell , Amit Agarwal , T. N. Vijaykumar , Babak Falsafi , Kaushik Roy, Reducing set-associative cache energy via way-prediction and selective direct-mapping, Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, December 01-05, 2001, Austin, Texas
|
 |
23
|
|
| |
24
|
|
| |
25
|
C. Rowen. Engineering the Complex SOC. Prentice Hall, 2004.
|
| |
26
|
J. S. Seng and D. M. Tullsen. Architecture-Level Power Optimization-What Are the Limits? Journal of Instruction-Level Parallelism 7, 7(3):1--20, January 2005.
|
| |
27
|
P. Shivakumar and N. P. Jouppi. Cacti 3.0: An Integrated Cache Timing, Power, Area Model. Technical Report 2001/02, Compaq Western Research Laboratory, Aug. 2001.
|
| |
28
|
|
| |
29
|
|
 |
30
|
|
| |
31
|
A. Zmily, E. Killian, and C. Kozyrakis. Improving Instruction Delivery with a Block-Aware ISA. In The Proceedings of EuroPar Conference, pages 530--539, Lisbon, Portugal, August 2005.
|
 |
32
|
|
 |
33
|
|
| |
34
|
Ahmad Zmily , Christos Kozyrakis, Simultaneously improving code size, performance, and energy in embedded processors, Proceedings of the conference on Design, automation and test in Europe: Proceedings, March 06-10, 2006, Munich, Germany
|
|