|
ABSTRACT
Because of stringent power constraints, aggressive latency-hiding approaches, such as prefetching, are absent in the state-of-the-art embedded processors. There are two main reasons that make prefetching power inefficient. First, compiler-inserted prefetch instructions increase code size and, therefore, could increase I-cache power. Second, inaccurate prefetching (especially for hardware prefetching) leads to high D-cache power consumption because of useless accesses. In this work, we show that it is possible to support power-efficient prefetching through bit-differential offset assignment. We target the prefetching of relocatable stack variables with a high degree of precision. By assigning the offsets of stack variables in such a way that most consecutive addresses differ by 1 bit, we can prefetch them with compact prefetch instructions to save I-cache power. The compiler first generates an access graph of consecutive memory references and then attempts a layout of the memory locations in the smallest hypercube. Each dimension of the hypercube represents a 1-bit differential addressing. The embedding is carried out in as compact a hypercube as possible in order to save memory space. Each load/store instruction carries a hint regarding prefetching the next memory reference by encoding its differential address with respect to the current one. To reduce D-cache power cost, we further attempt to assign offsets so that most of the consecutive accesses map to the same cache line. Our prefetching is done using a one entry line buffer [Wilson et al. 1996]. Consequently, many look-ups in D-cache reduce to incremental ones. This results in D-cache activity reduction and power savings. Our prefetcher requires both compiler and hardware support. In this paper, we provide implementation on the processor model close to ARM with small modification to the ISA. We tackle issues such as out-of-order commit, predication, and speculation through simple modifications to the processor pipeline on noncritical paths. Our goal in this work is to boost performance while maintaining/lowering power consumption. Our results show 12% speedup and slight power reduction. The runtime virtual space loss for stack and static data is about 11.8%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
ARM Co.Ltd, ARM 7TDMI Data Sheet.
|
| |
2
|
ARM Co.Ltd, ARM7500FE Data Sheet.
|
| |
3
|
Alfred V. Aho , Ravi Sethi , Jeffrey D. Ullman, Compilers: principles, techniques, and tools, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1986
|
 |
4
|
|
| |
5
|
|
| |
6
|
Burger, D. and Austin, T. M. 1997. The SimpleScalar tool set version 2.0. Tech. Report 1342, Univ. of Wisconsin--Madison (May).
|
 |
7
|
Brad Calder , Chandra Krintz , Simmi John , Todd Austin, Cache-conscious data placement, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.139-149, October 02-07, 1998, San Jose, California, United States
|
 |
8
|
|
 |
9
|
Sangyeun Cho , Pen-Chung Yew , Gyungho Lee, Decoupling local variable accesses in a wide-issue superscalar processor, Proceedings of the 26th annual international symposium on Computer architecture, p.100-110, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
10
|
M. R. Guthaus , J. S. Ringenberg , D. Ernst , T. M. Austin , T. Mudge , R. B. Brown, MiBench: A free, commercially representative embedded benchmark suite, Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on, p.3-14, December 02-02, 2001
[doi> 10.1109/WWC.2001.15]
|
| |
11
|
Gadi Haber , Moshe Klausner , Vadim Eisenberg , Bilha Mendelson , Maxim Gurevich, Optimization opportunities created by global data reordering, Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, March 23-26, 2003, San Francisco, California
|
| |
12
|
Intel Corp. SA-110 Microprocessor Tech. Ref. Manual.
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
Noth, W. and Kolla, R. 1990. Spanning tree-based state encoding for low-power dissipation. DATE, 168--174.
|
| |
17
|
Mikko H. Lipasti , William J. Schmidt , Steven R. Kunkel , Robert R. Roediger, SPAID: software prefetching in pointer- and call-intensive environments, Proceedings of the 28th annual international symposium on Microarchitecture, p.231-236, November 29-December 01, 1995, Ann Arbor, Michigan, United States
|
| |
18
|
Toshihiro Ozawa , Yasunori Kimura , Shin'ichiro Nishizaki, Cache miss heuristics and preloading techniques for general-purpose programs, Proceedings of the 28th annual international symposium on Microarchitecture, p.243-248, November 29-December 01, 1995, Ann Arbor, Michigan, United States
|
| |
19
|
|
| |
20
|
|
| |
21
|
Pomerene, J., Puzak, T., Rechtschaffen, R., and Sparacio, F. 1989. Prefetching system for a cache having a second directory for sequentially accessed blocks. U. S. Patent number 4,807,110 (Feb.).
|
 |
22
|
|
| |
23
|
Segars, S. 2001. Low power design techniques for microprocessors. ISSCC (Feb.).
|
 |
24
|
|
| |
25
|
Udayanarayanan, S. and Chakrabarti, C. 2001. Address code generation for DSPs. DAC.
|
| |
26
|
|
| |
27
|
|
 |
28
|
Kenneth M. Wilson , Kunle Olukotun , Mendel Rosenblum, Increasing cache port efficiency for dynamic superscalar microprocessors, Proceedings of the 23rd annual international symposium on Computer architecture, p.147-157, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
29
|
Wilton, S. and Jouppi, N. P. 1993. An enhanced access and cycle time model for on-chip caches. Technical Report TN93/5, Compaq Western Research Lab.
|
 |
30
|
Xiaotong Zhuang , ChokSheak Lau , Santosh Pande, Storage assignment optimizations through variable coalescence for embedded processors, Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems, June 11-13, 2003, San Diego, California, USA
|
 |
31
|
Xiaotong Zhuang , Santosh Pande, Power-efficient prefetching via bit-differential offset assignment on embedded processors, Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, June 11-13, 2004, Washington, DC, USA
|
|