|
ABSTRACT
Modern embedded processors are designed to maximize execution efficiency--the amount of performance achieved per unit of energy dissipated while meeting minimum performance levels. To increase this efficiency we propose utilizing static strands, dependence chains without fan-out which are exposed by a compiler pass. These dependent instructions are resequenced to be sequential and annotated to communicate their location to the hardware. Importantly, this modified application is binary compatible and functionally identical to the original, allowing transparent execution on a baseline processor. However, these static strands can be easily collapsed and optimized by simple processor modifications, significantly reducing the workload energy. Results show that over 30% of MediaBench and Spec2000int dynamic instructions can be collapsed, reducing issue logic energy by 16 to 24%, bypass energy 17 to 20%, and register file energy 13 to 14%. Additionally, by increasing the effective capactity of pipeline resources by almost a third, average IPC can be improved up to 15%. This performance gain can then be traded in for a lower clock frequency to maintain a basline level of performance, reducing energy further.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Bik, M. Girkar, P. Grey, and X. Tian. Efficient exploitation of parallelism on Pentium III and Pentium 4 processor-based systems. In Intel Technology Journal, Q1 2001.
|
| |
2
|
|
| |
3
|
D. Brash. The ARM architecture version 6 (ARMv6). White paper, ARM, 2002.
|
| |
4
|
D. Burger and T. Austin. The Simplescalar tool set, version 2.0. Technical Report 1342, Dept of Computer Science, University of Wisconsin-Madison, 1997.
|
| |
5
|
|
| |
6
|
Y. Cao, T. Sato, D. Sylvester, M. Orshansky, and C. Hu. New paradigm of predictive mosfet and interconnect modeling for early circuit design. In Proceedings of IEEE Custom Integrated Circuits Conference, June 2000.
|
| |
7
|
Nathan Clark , Manjunath Kudlur , Hyunchul Park , Scott Mahlke , Krisztian Flautner, Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization, Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, p.30-40, December 04-08, 2004, Portland, Oregon
[doi> 10.1109/MICRO.2004.5]
|
| |
8
|
|
 |
9
|
|
| |
10
|
S. Gochman, R. Ronen, I. Anati, A. Berkovits, T. Kurts, A. Naveh, A. Saeed, Z. Sperber, and R. Valentine. The Intel Pentium M processor: Microarchitecture and performance. Intel Technology Journal, 7(2), May 2003.
|
| |
11
|
Wen-Mei W. Hwu , Scott A. Mahlke , William Y. Chen , Pohua P. Chang , Nancy J. Warter , Roger A. Bringmann , Roland G. Ouellette , Richard E. Hank , Tokuzo Kiyohara , Grant E. Haab , John G. Holm , Daniel M. Lavery, The superblock: an effective technique for VLIW and superscalar compilation, The Journal of Supercomputing, v.7 n.1-2, p.229-248, May 1993
[doi> 10.1007/BF01205185]
|
| |
12
|
IBM Corporation. PowerPC 750 RISC Microprocessor Technical Summary. www.ibm.com.
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
 |
16
|
|
| |
17
|
|
| |
18
|
Chunho Lee , Miodrag Potkonjak , William H. Mangione-Smith, MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.330-335, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
19
|
M. Mamidipaka and N. Dutt. eCACTI: An enhanced power estimation model for on-chip caches. Technical Report 04-28, Center for Embedded Computer Systems, UC Irvine, 2004.
|
| |
20
|
A. Marquez, K. Theobald, X. Tang, and G. Gao. A superstrand architecture. Technical Memo 14, University of Delaware, Computer Architecture and Parallel Systems Laboratory, 1997.
|
| |
21
|
|
 |
22
|
Subbarao Palacharla , Norman P. Jouppi , J. E. Smith, Complexity-effective superscalar processors, Proceedings of the 24th annual international symposium on Computer architecture, p.206-218, June 01-04, 1997, Denver, Colorado, United States
|
| |
23
|
|
 |
24
|
|
| |
25
|
Renesas Technology. SH-4A Software Manual. www.renesas.com.
|
| |
26
|
|
| |
27
|
UC Berkeley. Berkeley predictive technology model. www-device.eecs.berkeley.edu/~ptm.
|
 |
28
|
|
CITED BY 5
|
|
Nathan Clark , Amir Hormati , Scott Mahlke , Sami Yehia, Scalable subgraph mapping for acyclic computation accelerators, Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, October 22-25, 2006, Seoul, Korea
|
|
|
|
|
|
|
|
|
|
|
|
Ganesh Dasika , Shidhartha Das , Kevin Fan , Scott Mahlke , David Bull, DVFS in loop accelerators using BLADES, Proceedings of the 45th annual conference on Design automation, June 08-13, 2008, Anaheim, California
|
|