|
ABSTRACT
Pipeline flushes are becoming increasingly expensive in modern microprocessors with large instruction windows and deep pipelines. Selective re-execution is a technique that can reduce the penalty of mis-speculations by re-executing only instructions affected by the mis-speculation, instead of all instructions. In this paper we introduce a new selective re-execution mechanism that exploits the properties of a dataflow-like Explicit Data Graph Execution (EDGE) architecture to support efficient mis-speculation recovery, while scaling to window sizes of thousands of instructions with high performance. This distributed selective re-execution (DSRE) protocol permits multiple speculative waves of computation to be traversing a dataflow graph simultaneously, with a commit wave propagating behind them to ensure correct execution. We evaluate one application of this protocol to provide efficient recovery for load-store dependence speculation. Unlike traditional dataflow architectures which resorted to single-assignment memory semantics, the DSRE protocol combines dataflow execution with speculation to enable high performance and conventional sequential memory semantics. Our experiments show that the DSRE protocol results in an average 17% speedup over the best dependence predictor proposed to date, and obtains 82% of the performance possible with a perfect oracle directing the issue of loads.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Vikas Agarwal , M. S. Hrishikesh , Stephen W. Keckler , Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th annual international symposium on Computer architecture, p.248-259, June 2000, Vancouver, British Columbia, Canada
|
| |
2
|
|
| |
3
|
Doug Burger , Stephen W. Keckler , Kathryn S. McKinley , Mike Dahlin , Lizy K. John , Calvin Lin , Charles R. Moore , James Burrill , Robert G. McDonald , William Yoder , the TRIPS Team, Scaling to the End of Silicon with EDGE Architectures, Computer, v.37 n.7, p.44-55, July 2004
[doi> 10.1109/MC.2004.65]
|
| |
4
|
B. Calder and G. Reinman. A comparative survey of load speculation architectures. Journal of Instruction-Level Parallelism, 2, May 2000.
|
| |
5
|
|
 |
6
|
|
 |
7
|
|
| |
8
|
D. Ernst and T. Austin. Practical Selective Replay for Reduced-Tag Schedulers. In Proceedings of the 2nd Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD-2), pages 58--63, June 2003.
|
| |
9
|
Dan Ernst , Nam Sung Kim , Shidhartha Das , Sanjay Pant , Rajeev Rao , Toan Pham , Conrad Ziesler , David Blaauw , Todd Austin , Krisztian Flautner , Trevor Mudge, Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation, Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, p.7, December 03-05, 2003
|
 |
10
|
David M. Gallagher , William Y. Chen , Scott A. Mahlke , John C. Gyllenhaal , Wen-mei W. Hwu, Dynamic memory disambiguation using the memory conflict buffer, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.183-193, October 05-07, 1994, San Jose, California, United States
|
| |
11
|
|
| |
12
|
G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The microarchitecture of the Pentium 4 processor. Intel Technology Journal Q1, 2001.
|
 |
13
|
Jaehyuk Huh , Jichuan Chang , Doug Burger , Gurindar S. Sohi, Coherence decoupling: making use of incoherence, Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, October 07-13, 2004, Boston, MA, USA
|
| |
14
|
J. B. Keller, R. W. Haddad, and S. G. Meier. Scheduler which discovers non-speculative nature of an instruction after issuing and reissues the instruction. United States Patent 6,564,315, May 2003.
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
A. A. Merchant, D. J. Sager, and D. D. Boggs. Computer processor with a replay system. United States Patent 6,163,838, December 2000.
|
| |
19
|
A. A. Merchant, D. J. Sager, D. D. Boggs, and M. D. Upton. Computer processor with a replay system having a plurality of checkers. United States Patent 6,094,717, July 2000.
|
 |
20
|
Andreas Moshovos , Scott E. Breach , T. N. Vijaykumar , Gurindar S. Sohi, Dynamic speculation and synchronization of data dependences, Proceedings of the 24th annual international symposium on Computer architecture, p.181-193, June 01-04, 1997, Denver, Colorado, United States
|
| |
21
|
|
| |
22
|
R. Panwar and R. C. Hetherington. Appartus for executing coded dependent instructions having variable latencies. United States Patent 5,987,594, November 1999.
|
 |
23
|
|
| |
24
|
N. Ranganathan, R. Nagarajan, D. Burger, and S. W. Keckler. Combining hyperblocks and exit prediction to increase front-end bandwidth and performance. Technical Report TR-02-41, Department of Computer Sciences, The University of Texas at Austin, Austin, TX, September 2002.
|
| |
25
|
Eric Rotenberg , Quinn Jacobson , Yiannakis Sazeides , Jim Smith, Trace processors, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.138-148, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
26
|
|
 |
27
|
Karthikeyan Sankaralingam , Ramadass Nagarajan , Haiming Liu , Changkyu Kim , Jaehyuk Huh , Doug Burger , Stephen W. Keckler , Charles R. Moore, Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture, Proceedings of the 30th annual international symposium on Computer architecture, June 09-11, 2003, San Diego, California
|
| |
28
|
Jared Stark , Paul Racunas , Yale N. Patt, Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.34-43, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
29
|
|
| |
30
|
Trimaran: An infrastructure for research in instruction-level parallelism. http://www.trimaran.org.
|
 |
31
|
Adi Yoaz , Mattan Erez , Ronny Ronen , Stephan Jourdan, Speculation techniques for improving load related instruction scheduling, Proceedings of the 26th annual international symposium on Computer architecture, p.42-53, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
32
|
H. Zhou, C. ying Fu, E. Rotenberg, and T. Conte. A study of value speculative execution and misspeculation recovery in superscalar microprocessors. Technical report, ECE Department, N. C. State University, January 2000.
|
CITED BY
|
|
Smruti R. Sarangi , Wei Liu, Josep Torrellas , Yuanyuan Zhou, ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, p.257-270, November 12-16, 2005, Barcelona, Spain
|
REVIEW
"Arvid G. Larson : Reviewer"
Pipeline architectures with expansive microarchitectural structures, using increasingly large instruction windows and look-ahead depths, are often limited in throughput efficiency. This limitation is primarily due to frequent misspeculation of ins
more...
|