| Optimization of instruction fetch mechanisms for high issue rates |
| Full text |
Pdf
(1.19 MB)
|
| Source
|
International Symposium on Computer Architecture
archive
Proceedings of the 22nd annual international symposium on Computer architecture
table of contents
S. Margherita Ligure, Italy
Pages: 333 - 344
Year of Publication: 1995
ISBN:0-89791-698-0
Also published in ...
|
|
Authors
|
|
Thomas M. Conte
|
Computer Architecture Research Laboratory, Department of Electrical and Computer Engineering, University of South Carolina, Columbia, South Carolina
|
|
Kishore N. Menezes
|
Computer Architecture Research Laboratory, Department of Electrical and Computer Engineering, University of South Carolina, Columbia, South Carolina
|
|
Patrick M. Mills
|
Computer Architecture Research Laboratory, Department of Electrical and Computer Engineering, University of South Carolina, Columbia, South Carolina
|
|
Burzin A. Patel
|
Computer Architecture Research Laboratory, Department of Electrical and Computer Engineering, University of South Carolina, Columbia, South Carolina
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 5, Downloads (12 Months): 39, Citation Count: 46
|
|
|
ABSTRACT
Recent superscalar processors issue four instructions per cycle. These processors are also powered by highly-parallel superscalar cores. The potential performance can only be exploited when fed by high instruction bandwidth. This task is the responsibility of the instruction fetch unit. Accurate branch prediction and low I-cache miss ratios are essential for the efficient operation of the fetch unit. Several studies on cache design and branch prediction address this problem. However, these techniques are not sufficient. Even in the presence of efficient cache designs and branch prediction, the fetch unit must continuously extract multiple, non-sequential instructions from the instruction cache, realign these in the proper order, and supply them to the decoder. This paper explores solutions to this problem and presents several schemes with varying degrees of performance and cost. The most-general scheme, the collapsing buffer, achieves near-perfect performance and consistently aligns instructions in excess of 90% of the time, over a wide range of issue rates. The performance boost provided by compiler optimization techniques is also investigated. Results show that compiler optimization can significantly enhance performance across all schemes. The collapsing buffer supplemented by compiler techniques remains the best-performing mechanism. The paper closes with recommendations and suggestions for future.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
L. Gwennap, "MIPS RI0000 uses decoupled architecture," Microprocessor Report, Oct. 1994.
|
| |
2
|
A. Agarwal, "UltraSPARC: A new era in SPARC performance," in 1994 M, croprocessor Forum Pro. ceedings, Oct. 1994.
|
| |
3
|
M. Slater, "AMD's K5 designed to outrun Pentium," Microprocessor Report, Oct. 1994.
|
 |
4
|
|
 |
5
|
|
 |
6
|
|
 |
7
|
|
 |
8
|
|
| |
9
|
|
 |
10
|
|
| |
11
|
S. McFarling, "Combining branch predictors," WRL Technical Note TN-36, Digital Equipment Corporation, 1993.
|
| |
12
|
L. Gwennap, "PA-8000 combines complexity and speed," Microprocessor Report, Nov. 1994.
|
| |
13
|
|
| |
14
|
|
| |
15
|
S. P. Song and M. Denman, "The PowerPC 604 RISC microprocessor," tech. rep., Somerset Design Center, Austin, TX, Apr. 1994.
|
| |
16
|
M. Johnson, Superscalar microprocessor design. Englewood Cliffs, NJ: Prentice-Hail, 1991.
|
| |
17
|
J. A. Fisher, "Trace scheduling: A technique for global microcode compaction," IEEE Trans. Cornput., vol. C-30, no. 7, pp. 478-490, July 1981.
|
| |
18
|
Wen-Mei W. Hwu , Scott A. Mahlke , William Y. Chen , Pohua P. Chang , Nancy J. Warter , Roger A. Bringmann , Roland G. Ouellette , Richard E. Hank , Tokuzo Kiyohara , Grant E. Haab , John G. Holm , Daniel M. Lavery, The superblock: an effective technique for VLIW and superscalar compilation, The Journal of Supercomputing, v.7 n.1-2, p.229-248, May 1993
[doi> 10.1007/BF01205185]
|
| |
19
|
|
| |
20
|
Mark Smotherman , Shuchi Chawla , Stan Cox , Brian Malloy, Instruction scheduling for the Motorola 88110, Proceedings of the 26th annual international symposium on Microarchitecture, p.257-262, December 01-03, 1993, Austin, Texas, United States
|
 |
21
|
Thomas M. Conte , Burzin A. Patel , J. Stan Cox, Using branch handling hardware to support profile-driven optimization, Proceedings of the 27th annual international symposium on Microarchitecture, p.12-21, November 30-December 02, 1994, San Jose, California, United States
[doi> 10.1145/192724.192726]
|
CITED BY 46
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Quinn Jacobson , Eric Rotenberg , James E. Smith, Path-based next trace prediction, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.14-23, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dean M. Tullsen , Susan J. Eggers , Joel S. Emer , Henry M. Levy , Jack L. Lo , Rebecca L. Stamm, Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor, ACM SIGARCH Computer Architecture News, v.24 n.2, p.191-202, May 1996
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Glenn Reinman , Brad Calder , Dean Tullsen , Gary Tyson , Todd Austin, Classifying load and store instructions for memory renaming, Proceedings of the 13th international conference on Supercomputing, p.399-407, June 20-25, 1999, Rhodes, Greece
|
|
|
|
|
|
Eric Hao , Po-Yung Chang , Marius Evers , Yale N. Patt, Increasing the instruction fetch rate via block-structured instruction set architectures, Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, p.191-200, December 02-04, 1996, Paris, France
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Francisca Quintana , Jesus Corbal , Roger Espasa , Mateo Valero, Adding a vector unit to a superscalar processor, Proceedings of the 13th international conference on Supercomputing, p.1-10, June 20-25, 1999, Rhodes, Greece
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Roni Rosner , Micha Moffie , Yiannakis Sazeides , Ronny Ronen, Selecting long atomic traces for high coverage, Proceedings of the 17th annual international conference on Supercomputing, June 23-26, 2003, San Francisco, CA, USA
|
|
|
|
|
|
|
|
|
|
|
|
Yoav Almog , Roni Rosner , Naftali Schwartz , Ari Schmorak, Specialized Dynamic Optimizations for High-Performance Energy-Efficient Microarchitecture, Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, p.137, March 20-24, 2004, Palo Alto, California
|
|
|
Alex Ramírez , Josep-L. Larriba-Pey , Carlos Navarro , Josep Torrellas , Mateo Valero, Software trace cache, Proceedings of the 13th international conference on Supercomputing, p.119-126, June 20-25, 1999, Rhodes, Greece
|
|
|
Vladimir Stojanovic , R. Iris Bahar , Jennifer Dworak , Richard Weiss, A cost-effective implementation of an ECC-protected instruction queue for out-of-order microprocessors, Proceedings of the 43rd annual conference on Design automation, July 24-28, 2006, San Francisco, CA, USA
|
|
|
|
|
|
Juan C. Moure , Domingo Benítez , Dolores I. Rexachs , Emilio Luque, Wide and efficient trace prediction using the local trace predictor, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
|
|
|
|
|
|
|
|
|
|
|