|
ABSTRACT
In an effort to push the envelope of system performance, microprocessor designs are continually exploiting higher levels of instruction-level parallelism, resulting in increasing bandwidth demands on the address translation mechanism. Most current microprocessor designs meet this demand with a multi-ported TLB. While this design provides an excellent hit rate at each port, its access latency and area grow very quickly as the number of ports is increased. As bandwidth demands continue to increase, multi-ported designs will soon impact memory access latency.We present four high-bandwidth address translation mechanisms with latency and area characteristics that scale better than a multiported TLB design. We extend traditional high-bandwidth memory design techniques to address translation, developing interleaved and multi-level TLB designs. In addition, we introduce two new designs crafted specifically for high-bandwidth address translation. Piggyback ports are introduced as a technique to exploit spatial locality in simultaneous translation requests, allowing accesses to the same virtual memory page to combine their requests at the TLB access port. Pretranslation is introduced as a technique for attaching translations to base register values, making it possible to reuse a single translation many times.We perform extensive simulation-based studies to evaluate our designs. We vary key system parameters, such as processor model, page size, and number of architected registers, to see what effects these changes have on the relative merits of each approach. A number of designs show particular promise. Multi-level TLBs with as few as eight entries in the upper-level TLB nearly achieve the performance of a TLB with unlimited bandwidth. Piggyback ports combined with a lesser-ported TLB structure, e.g., an interleaved or multi-ported TLB, also perform well. Pretranslation over a single-ported TLB performs almost as well as a same-sized multi-level TLB with the added benefit of decreased access latency for physically indexed caches.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
BF92
|
|
| |
BHIL94
|
|
 |
BRG+89
|
D. L. Black , R. F. Rashid , D. B. Golub , C. R. Hill, Translation lookaside buffer consistency: a software approach, Proceedings of the third international conference on Architectural support for programming languages and operating systems, p.113-122, April 03-06, 1989, Boston, Massachusetts, United States
|
 |
CBJ92
|
|
 |
CCH+87
|
F. Chow , S. Correll , M. Himelstein , E. Killian , L. Weber, How many addressing modes are enough?, Proceedings of the second international conference on Architectual support for programming languages and operating systems, p.117-121, October 1987, Palo Alto, California, United States
|
| |
Che87
|
R. Cheng. Virtual address caches in UNIX. Proc. of the Summer 1987 USENIX Technical Conference, pages 217-224, 1987.
|
 |
CK92
|
|
 |
CMMP95
|
Thomas M. Conte , Kishore N. Menezes , Patrick M. Mills , Burzin A. Patel, Optimization of instruction fetch mechanisms for high issue rates, Proceedings of the 22nd annual international symposium on Computer architecture, p.333-344, June 22-24, 1995, S. Margherita Ligure, Italy
|
| |
EV93
|
R.J. Eickemeyer and S. Vassiliadis. A load-instruction unit for pipelined processors. IBM J. Res. Develop., 37(4):547-564, July 1993.
|
| |
Gwe95
|
L. Gwennap. Hal reveals multichip SPARC processor. Mtcroprocessor Report, 9(3):1-11, March 1995.
|
| |
Hea86
|
Mark Hill , Susan Eggers , Jim Larus , George Taylor , Glenn Adams , B. K. Bose , Garth Gibson , Paul Hansen , Jon Keller , Shing Kong , Corinna Lee , Daebum Lee , Joan Pendleton , Scott Ritchie , David Wood , Ben Zorn , Paul Hilfinger , Dave Hodges , Randy Katz , John Ousterhout , Dave Patterson, Design decisions in SPUR, Computer, v.19 n.11, p.8-22, Nov. 1986
[doi> 10.1109/MC.1986.1663096]
|
| |
HHL+90
|
K. Hua. A. Hunt, L. Liu, J-K. Peir, D. Pruett, and J. Temple. Early resolution of address translation in cache design. Proc. of the 1990 IEEE International Conference on Computer Design, pages 408-412, September 1990.
|
| |
HP90
|
|
| |
Jol91
|
R. Jolly. A 9-ns 1.4 gigabyte/s, 17-ported CMOS register file. IEEE J. of Solid-State Circuits, 25:1407-1412, October 1991.
|
 |
JW94
|
|
 |
KCE92
|
Eric J. Koldinger , Jeffrey S. Chase , Susan J. Eggers, Architecture support for single address space operating systems, Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, p.175-186, October 12-15, 1992, Boston, Massachusetts, United States
|
| |
KH92
|
|
 |
KJLH89
|
|
| |
LE89
|
|
| |
LS94
|
|
 |
Rau91
|
|
 |
SF91
|
|
 |
TH94
|
|
 |
WBL89
|
|
| |
WE88
|
|
 |
YP93
|
|
CITED BY 14
|
|
Jude A. Rivers , Gary S. Tyson , Edward S. Davidson , Todd M. Austin, On high-bandwidth data cache design for multi-issue processors, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.46-56, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
|
|
David López , Josep Llosa , Mateo Valero , Eduard Ayguadé, Widening resources: a cost-effective technique for aggressive ILP architectures, Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, p.237-246, November 1998, Dallas, Texas, United States
|
|
|
|
|
|
Toni Juan , Tomas Lang , Juan J. Navarro, Reducing TLB power requirements, Proceedings of the 1997 international symposium on Low power electronics and design, p.196-201, August 18-20, 1997, Monterey, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|