|
ABSTRACT
Pre-execution removes the microarchitectural latency of "problem" loads from a programýs critical path by redundantly executing copies of their computations in parallel with the main program. There have been several proposed pre-execution systems, a quantitative framework (PTHSEL) for analytical pre-execution thread (p-thread) selection, and even a research prototype. To date, however, the energy aspects of pre-execution have not been studied. Cycle-level performance and energy simulations on SPEC2000 integer benchmarks that suffer from L2 misses show that energy-blind pre-execution naturally has a linear latency/energy trade-off, improving performance by 13.8% while increasing energy consumption by 11.9%. To improve this trade-off, we propose two extensions to PTHSEL. First, we replace the flat cycle-for-cycle load cost model with a model based on a critical-path estimation. This extension increases p-thread efficiency in an energy-independent way. Second, we add a parameterized energy model to PTHSEL (forming PTHSEL+E) that allows it to actively select p-threads that reduce energy rather than (or in combination with) execution latency. Experiments show that PTHSEL+E manipulates pre-executionýs latency/energy more effectively. Latency targeted selection benefits from the improved load cost model: its performance improvements grow to an average of 16.4% while energy costs drop to 8.7%. ED targeted selection produces p-threads that improve performance by only 12.9%, but ED by 8.8%. Targeting p-thread selection for energy reduction, results in "energy-free" pre-execution, with average speedup of 5.4%, and a small decrease in total energy consumption (0.7%).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
[2] D. Burger and T. Austin. "The SimpleScalar Tool Set, Version 2.0." Technical Report CS-TR-97-1342, University of Wisconsin-Madison, Jun. 1997.
|
 |
3
|
Robert S. Chappell , Jared Stark , Sangwook P. Kim , Steven K. Reinhardt , Yale N. Patt, Simultaneous subordinate microthreading (SSMT), Proceedings of the 26th annual international symposium on Computer architecture, p.186-195, May 01-04, 1999, Atlanta, Georgia, United States
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
 |
7
|
Jamison D. Collins , Hong Wang , Dean M. Tullsen , Christopher Hughes , Yong-Fong Lee , Dan Lavery , John P. Shen, Speculative precomputation: long-range prefetching of delinquent loads, Proceedings of the 28th annual international symposium on Computer architecture, p.14-25, June 30-July 04, 2001, Göteborg, Sweden
|
| |
8
|
|
 |
9
|
|
| |
10
|
[10] R. Gonzalez and M. Horowitz. "Energy Dissipation in General Purpose Microprocessors." IEEE Journal of Solid-State Circuits , 31(9), Sep. 1996.
|
| |
11
|
[11] Intel Corporation. Mobile Intel Pentium 4 M-Processor Datasheet, Jun. 2003. http://www.intel.com/design/mobile/ datashts/250686.htm.
|
 |
12
|
|
 |
13
|
|
 |
14
|
Steve S.W. Liao , Perry H. Wang , Hong Wang , Gerolf Hoflehner , Daniel Lavery , John P. Shen, Post-pass binary adaptation for software-based speculative precomputation, Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation, June 17-19, 2002, Berlin, Germany
|
 |
15
|
|
| |
16
|
[16] A. Martin, M. Nystroem, and P. Penzes. "ET2: A Metric for Time and Energy Efficiency of Computation." Technical Report CSTR:2001.007, CalTech, 2001.
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
[20] P. Shivakumar and N. Jouppi. "CACTI 3.0: An Integrated Cache Timing, Power, and Area Model." Technical report, COMPAQ Western Research Laboratory, 2001.
|
| |
21
|
[21] Y. Song and M. Dubois. "Assisted Execution." Technical Report #CENG 98-25, Department of EE-Systems, University of Southern California, Oct. 1998.
|
| |
22
|
|
 |
23
|
Perry H. Wang , Jamison D. Collins , Hong Wang , Dongkeun Kim , Bill Greene , Kai-Ming Chan , Aamir B. Yunus , Terry Sych , Stephen F. Moore , John P. Shen, Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform, Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, October 07-13, 2004, Boston, MA, USA
|
| |
24
|
[24] C.-L. Yang and A. Lebeck. "Push vs. Pull." In ICS-14, May 2000.
|
 |
25
|
|
|