|
ABSTRACT
In the pursuit of instruction-level parallelism, significant demands are placed on a processor's instruction delivery mechanism. Delivering the performance necessary to meet future processor execution targets requires that the performance of the instruction delivery mechanism scale with the execution core. Attaining these targets is a challenging task due to I-cache misses, branch mispredictions, and taken branches in the instruction stream. To further complicate matters, a VLSI interconnect scaling trend is materializing that further limits the performance of front-end designs in future generation process technologies.To counter these challenges, we present a fetch architecture that permits a faster cycle time than previous designs and scales better with future process technologies. Our design, called the Fetch Target Buffer, is a multi-level fetch block-oriented predictor. We decouple the FTB from the instruction fetch and decode pipelines to afford it the fastest clock possible. Through cycle-based simulation and circuit-level delay analysis, we find that our multi-level FTB design is capable of delivering instructions 25% faster than the best single-level BTB-based pipeline configuration. Moreover, we show that our design scales better to future process technologies than traditional single-level designs.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
H. Bakoglu and J. Meindl. Optimal interconnect circuits for VLSI. IEEE Transactions on Computers, 32(5):903-909, May 1985.
|
| |
2
|
M. Bohr. Interconnect scaling - the real timiter to high-performance ulsi. In Tec& Dig. of the International Electron Devices Meeting, pages 241-244, December 1995,
|
 |
3
|
|
| |
4
|
|
| |
5
|
D.C. Burger and T. M, Austin. The simplescalar tool set, version 2.0. Technical Report CS-TR-97-1342, University of Wisconsin, Madison, June 1997.
|
 |
6
|
|
 |
7
|
|
 |
8
|
Po-Yung Chang , Eric Hao , Yale N. Patt, Target prediction for indirect jumps, Proceedings of the 24th annual international symposium on Computer architecture, p.274-283, June 01-04, 1997, Denver, Colorado, United States
|
 |
9
|
Thomas M. Conte , Kishore N. Menezes , Patrick M. Mills , Burzin A. Patel, Optimization of instruction fetch mechanisms for high issue rates, Proceedings of the 22nd annual international symposium on Computer architecture, p.333-344, June 22-24, 1995, S. Margherita Ligure, Italy
|
| |
10
|
J. A. Fisher. Trace scheduling : A technique for global microcode compaction. IEEETrans. Comput., C-30(7):478-490, 1981.
|
| |
11
|
Eric Hao , Po-Yung Chang , Marius Evers , Yale N. Patt, Increasing the instruction fetch rate via block-structured instruction set architectures, Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, p.191-200, December 02-04, 1996, Paris, France
|
 |
12
|
|
| |
13
|
|
| |
14
|
D. Lammers. IBM's copper interconnects hit the market. EETimes, 9/3 issue, September 1998.
|
| |
15
|
D. Lammers. TI's 0.13-micron pmc.~s speeds system-on-a-chip designs. EETimes, 10/23 issue, October I998.
|
| |
16
|
|
| |
17
|
S. McFarling. Combining branch predictors. Technical Report TN-36, Digital Equipment Corporation, Western Research Lab, June 1993.
|
| |
18
|
|
 |
19
|
Subbarao Palacharla , Norman P. Jouppi , J. E. Smith, Complexity-effective superscalar processors, Proceedings of the 24th annual international symposium on Computer architecture, p.206-218, June 01-04, 1997, Denver, Colorado, United States
|
| |
20
|
S. Patel, D. Friendly, and Y. Patt. Critical issues regarding the trace cache fetch mechanism. CSE-TR-335-97, University of Michigan, May 1997.
|
| |
21
|
|
| |
22
|
G. Reinman, B. Calder, and T. Austin. Scalable multi-level instruction fetch prediction. Technical Report UCSD-CS99-613, University of California, San Diego, March 1999.
|
| |
23
|
|
 |
24
|
André Seznec , Stéphan Jourdan , Pascal Sainrat , Pierre Michaud, Multiple-block ahead branch predictors, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.116-127, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
25
|
Kevin Skadron , Pritpal S. Ahuja , Margaret Martonosi , Douglas W. Clark, Improving prediction for procedure returns with return-address-stack repair mechanisms, Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, p.259-271, November 1998, Dallas, Texas, United States
|
| |
26
|
K. Skadron, M. Martonosi, and D. Clark. Speculative updates of local and global branch history: A quantitative analysis. Technical Report TR-589-98, Princeton Dept. of Computer Science, December 1998.
|
| |
27
|
Jared Stark , Paul Racunas , Yale N. Patt, Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.34-43, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
 |
28
|
Richard Uhlig , David Nagle , Trevor Mudge , Stuart Sechrest , Joel Emer, Instruction fetching: coping with code bloat, Proceedings of the 22nd annual international symposium on Computer architecture, p.345-356, June 22-24, 1995, S. Margherita Ligure, Italy
|
| |
29
|
S. Wilton and N. Jouppi. An enhanced access and cycle time mode1 for on-chip caches. Compaq WRL TR-93-5, July 1994.
|
| |
30
|
|
 |
31
|
|
CITED BY 28
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jason Cong , Ashok Jagannathan , Glenn Reinman , Michail Romesis, Microarchitecture evaluation with physical planning, Proceedings of the 40th conference on Design automation, June 02-06, 2003, Anaheim, CA, USA
|
|
|
Ahmad Zmily , Christos Kozyrakis, Simultaneously improving code size, performance, and energy in embedded processors, Proceedings of the conference on Design, automation and test in Europe: Proceedings, March 06-10, 2006, Munich, Germany
|
|
|
Anahita Shayesteh , Glenn Reinman , Norm Jouppi , Tim Sherwood , Suleyman Sair, Improving the performance and power efficiency of shared helpers in CMPs, Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, October 22-25, 2006, Seoul, Korea
|
|
|
|
|
|
|
|
|
Oliverio J. Santana , Ayose Falcón , Alex Ramirez , Mateo Valero, Branch predictor guided instruction decoding, Proceedings of the 15th international conference on Parallel architectures and compilation techniques, September 16-20, 2006, Seattle, Washington, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|