|
ABSTRACT
The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance. However, multi-ported caches are costly to implement. In this paper we propose techniques for improving the bandwidth of a single cache port by using additional buffering in the processor, and by taking maximum advantage of a wider cache port. We evaluate these techniques using realistic applications that include the operating system. Our techniques using a single-ported cache achieve 91% of the performance of a dual-ported cache.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
Aspr93
|
Tom Asprey , Gregory S. Averill , Eric DeLano , Russ Mason , Bill Weiner , Jeff Yetter, Performance Features of the PA7100 Microprocessor, IEEE Micro, v.13 n.3, p.22-35, May 1993
[doi> 10.1109/40.216746]
|
| |
Benn95
|
|
| |
Chap91
|
Terry I. Chappell, Barbara A. Chappell, Stanley E. Schuster, James W. Allen, Stephen P. Klepner, Rajiv V. Joshi, and Robert L. Franch, "A 2-ns Cycle, 3.8- ns Access 512-kb CMOS ECL SRAM with a Fully Pipelined Architecture", IEEE Journal of Solid-State Circuits, VoI. 26, No. 11, November 1991, pp. 1577-1585.
|
 |
Chen92
|
|
 |
Chen94
|
|
 |
Conte92
|
|
 |
Cvet94
|
|
 |
Fark94
|
|
 |
Farr94
|
M. Farrens , G. Tyson , A. R. Pleszkun, A study of single-chip processor/cache organizations for large numbers of transistors, Proceedings of the 21ST annual international symposium on Computer architecture, p.338-347, April 18-21, 1994, Chicago, Illinois, United States
|
| |
Gee93
|
|
| |
Gray93
|
|
| |
Gwen94
|
Linley Gwennap, "MIPS R 10000 Uses Decoupled Architecture", Mxcroprocessor Report, Volume 8, Number 14, October 24, 1994, pp 18-22.
|
| |
Henn90
|
|
| |
John91
|
Mike Johnson, "Superscalar Microprocessor Design", Prentice.Hall Inc, 1991.
|
 |
Joup90
|
|
 |
Joup93
|
|
| |
Krof81
|
|
 |
Kusk94
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
| |
Mayn94
|
Ann Marie Grizzaffi Maynard, Colette M. Donnelly, and Bret R. Olszewski, "Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads", ASPLOS-VI, San Jose, CA, October 4-7, 1994.
|
| |
McLe93
|
|
 |
Rose95
|
M. Rosenblum , E. Bugnion , S. A. Herrod , E. Witchel , A. Gupta, The impact of architectural trends on operating system performance, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.285-298, December 03-06, 1995, Copper Mountain, Colorado, United States
|
| |
Rose95b
|
|
| |
MIPS94
|
MIPS Technologies, Incorporated, "R10000 Microprocessor Product Overwew", MIPS Open RISC Technology, MIPS Technologies, incorporated, October 1994.
|
| |
NEC94
|
NEC Corporation, "16M bit Synchronous DRAM, prelinunary data sheet", NEC Corporation, March 1994.
|
 |
Oluk92
|
|
 |
Przy88
|
S. Prybylski , M. Horowitz , J. Hennessy, Performance tradeoffs in cache design, Proceedings of the 15th Annual International Symposium on Computer architecture, p.290-298, May 30-June 02, 1988, Honolulu, Hawaii, United States
|
| |
Rau93
|
|
 |
Sohi91
|
|
| |
SPEC95
|
SPEC, "SPEC Benchmark Specifications - 101 .tomcatv", SPEC95 benchmarks release, 1995.
|
| |
Toma67
|
Tomasulo, R. M., "An Efficient Algorithm for Exploiting Multiple Arithmetic Units.", IBM Journal of Research and Development, Vol. 11 (January 1967), pp. 25-33.
|
| |
Uht86
|
Uht, A K., "An Efficient Hardware Algorithm to Extract Concum~ncy from General Purpose Code", Proceedings of the Nineteenth Annual Hawaii International Conference on System Sciences, 1986, pp. 41-50.
|
 |
Upto94
|
Michael Upton , Thomas Huff , Trevor Mudge , Richard Brown, Resource allocation in a high clock rate microprocessor, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.98-109, October 05-07, 1994, San Jose, California, United States
|
| |
Wall93
|
David W. Wall, "Limits of Instruction-Level Parallelism", WRL Research Report 93/6, Western Research Laboratory, 250 University Ave., Palo Alto, CA,
|
| |
Wilt94
|
Steven J. E. Wilton and Norman P. Jouppi, "An Enhanced Access and Cycle Time Model for On-Chip Caches", WRL Research Report 93/5, Western Research Laboratory, 250 University Ave., Palo Alto, CA, 94301
|
 |
Witc96
|
|
CITED BY 24
|
|
Tao Li , Lizy Kurian John , Vijaykrishnan Narayanan , Anand Sivasubramaniam , Jyotsna Sabarinathan , Anupama Murthy, Using complete system simulation to characterize SPECjvm98 benchmarks, Proceedings of the 14th international conference on Supercomputing, p.22-33, May 08-11, 2000, Santa Fe, New Mexico, United States
|
|
|
|
|
|
Jude A. Rivers , Gary S. Tyson , Edward S. Davidson , Todd M. Austin, On high-bandwidth data cache design for multi-issue processors, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.46-56, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
|
|
Jack L. Lo , Joel S. Emer , Henry M. Levy , Rebecca L. Stamm , Dean M. Tullsen , S. J. Eggers, Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading, ACM Transactions on Computer Systems (TOCS), v.15 n.3, p.322-354, Aug. 1997
|
|
|
|
|
|
|
|
|
|
|
|
David López , Mateo Valero , Josep Llosa , Eduard Ayguadé, Increasing memory bandwidth with wide buses: compiler, hardware and performance trade-offs, Proceedings of the 11th international conference on Supercomputing, p.12-19, July 07-11, 1997, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|