|
ABSTRACT
Previous research has shown that the SPEC benchmarks achieve low miss ratios in relatively small instruction caches. This paper presents evidence that current software-development practices produce applications that exhibit substantially higher instruction-cache miss ratios than do the SPEC benchmarks. To represent these trends, we have assembled a collection of applications, called the Instruction Benchmark Suite (IBS), that provides a better test of instruction-cache performance. We discuss the rationale behind the design of IBS and characterize its behavior relative to the SPEC benchmark suite. Our analysis is based on trace-driven and trap-driven simulations and takes into full account both the application and operating-system components of the workloads.This paper then reexamines a collection of previously-proposed hardware mechanisms for improving instruction-fetch performance in the context of the IBS workloads. We study the impact of cache organization, transfer bandwidth, prefetching, and pipelined memory systems on machines that rely on the use of relatively small primary instruction caches to facilitate increased clock rates. We find that, although of little use for SPEC, the right combination of these techniques substantially benefits IBS. Even so, under IBS, a stubborn lower bound on the instruction-fetch CPI remains as an obstacle to improving overall processor performance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
Accetta86
|
Accetta, M., Baron, R., Golub, D., Rashid, R., Tevanian, A. and Young, M. Mach: A new kernel foundation for UN1X development, In the Summer 1986 USENIX Conference.
|
 |
Alexander85
|
|
 |
Alexander86
|
|
 |
Agarwal88
|
|
| |
Baer87
|
Baer, J.-L. and Wang, W.-H. Architectural choices for multi-level cache hierarchies. In the 16th International Conference on Parallel Processing: 258-261, 1987.
|
 |
Baer88
|
|
 |
Bershad94
|
Brian N. Bershad , Dennis Lee , Theodore H. Romer , J. Bradley Chen, Avoiding conflict misses dynamically in large direct-mapped caches, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.158-170, October 05-07, 1994, San Jose, California, United States
|
| |
Bomberger92
|
Allen C. Bomberger , William S. Frantz , Ann C. Hardy , Norman Hardy , Charles R. Landau , Jonathan S. Shapiro, The KeyKOS Nanokernel Architecture, Proceedings of the Workshop on Micro-kernels and Other Kernel Architectures, p.95-112, April 27-28, 1992
|
| |
Borg90
|
Borg, A., Kessler, R. and Wall, D. Generation and analysis of very long address traces, In the 17th ISCA, Seattle, WA, 1990.
|
| |
Bray90
|
|
| |
Brunner91
|
|
| |
Budd91
|
|
| |
Calder94
|
Calder, B., Grunwald, D. and Zorn, B. Quantifying behavioral differences between C and C++ programs. The Department of Computer Science, University of Colorado. CU- CS-698-94.1994.
|
 |
Chen93
|
|
| |
Chen94
|
Chen, B. Memory behavior of an Xll window system, In the USENIX Winter 1994 Technical Conference, 1994.
|
| |
Cheriton84
|
Cheriton, D. R. The V kernel: A software base for distributed systems. IEEE Software 1 (2): 19-42, 1984.
|
 |
Clark83
|
|
 |
Clark85
|
|
 |
Clark88
|
D. W. Clark , P. J. Bannon , J. B. Keller, Measuring VAX 8800 performance with a histogram hardware monitor, Proceedings of the 15th Annual International Symposium on Computer architecture, p.176-185, May 30-June 02, 1988, Honolulu, Hawaii, United States
|
 |
Cmelik94
|
|
| |
Custer93
|
|
 |
Cvetanovic94
|
|
 |
Emer84
|
|
 |
Farrens89
|
|
| |
Flanagan93
|
Flanagan, J. K., Nelson, B. E. and Archibald, J. K. The inaccuracy of trace-driven simulation using incomplete trace data. Brigham Young University. 1993.
|
| |
Gee93
|
|
| |
Happel92
|
Happel, L. P. and Jayasumana, A. P. Perfomtance of a RISC machine with two-level caches. IEE Proceedings-E 139 (3): 221-229, 1992.
|
| |
Hennessy90
|
|
| |
Hill87
|
Hill, M. Aspects of cache memory and instruction buffer performance. The University of California at Berkeley. 1987.
|
 |
Huck93
|
|
 |
Hwu89
|
|
 |
Jouppi90
|
|
 |
Jouppi94
|
|
| |
Koch94
|
Koch, P. Emulating the 68040 in the PowerPC Macintosh, In Microprocessor Forum, San Francisco, CA, 1994.
|
| |
Kessler91
|
Kessler, R. Analysis of multi-megabyte secondary CPU cache memories. University of Wisconsin-Madison. 1991.
|
 |
Kessler92
|
|
| |
Malan91
|
Malan, G., Rashid, R., Golub, D. and Baron, R. DOS as a Mach 3.0 application, In the USENIX Mach Symposium, 27- 40, 1991.
|
 |
Maynard94
|
Ann Marie Grizzaffi Maynard , Colette M. Donnelly , Bret R. Olszewski, Contrasting characteristics and cache performance of technical and multi-user commercial workloads, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.145-156, October 05-07, 1994, San Jose, California, United States
|
 |
McFarling89
|
|
 |
Mogul91
|
|
| |
MReport92–95
|
Microprocessor Report. Sebastopol, CA, MicroDesign Resources, 1992, 1993, 1994 and 1995.
|
| |
Mulder91
|
Mulder, J., Quach, N. and Flynn, M. An area model for on-chip memories and its application. IEEE Journal of Solid- State Circuits 26 (2): 98-106, 1991.
|
| |
Nagle92
|
Nagle, D., Uhlig, R., Mudge, T., Monster: a tool for analyzing the interaction between operating systems and architectures. CSE-TR147-92. University of Michigan, 1992.
|
 |
Nagle93
|
David Nagle , Richard Uhlig , Tim Stanley , Stuart Sechrest , Trevor Mudge , Richard Brown, Design tradeoffs for software-managed TLBs, Proceedings of the 20th annual international symposium on Computer architecture, p.27-38, May 16-19, 1993, San Diego, California, United States
|
 |
Nagle94
|
D. Nagle , R. Uhlig , T. Mudge , S. Sechrest, Optimal allocation of on-chip memory for multiple-API operating systems, Proceedings of the 21ST annual international symposium on Computer architecture, p.358-369, April 18-21, 1994, Chicago, Illinois, United States
|
 |
Olukotun91
|
O. A. Olukotun , T. N. Mudge , R. B. Brown, Implementing a cache for a high-performance GaAs microprocessor, Proceedings of the 18th annual international symposium on Computer architecture, p.138-147, May 27-30, 1991, Toronto, Ontario, Canada
|
 |
Olukotun92
|
|
| |
Ousterhout94
|
|
 |
Palcharla94
|
|
| |
Patel92
|
Patel, K., Smith, B. C. and Rowe, L. A. Performance of a Software MPEG Video Decoder. University of California, Berkeley. 1992.
|
| |
Pierce95
|
|
 |
Przybylski89
|
|
 |
Przybylski90
|
|
| |
Rozier92
|
|
 |
Scheifler86
|
|
 |
Short88
|
|
 |
Sites88
|
|
| |
Sites92
|
Sites, R., Chernoff, A., Kirk, M., Marks, M. and Robinson, S. Binary translation. Digital Technical Journal 4 (4): 137- 152, 1992.
|
| |
Smith78
|
Smith, A. J. Sequential program prefetching in memory hierarchies. IEEE Computer 11 (12): 7-21, 1978.
|
 |
Smith82
|
|
 |
Smith85
|
|
| |
Smith92
|
|
| |
SPEC91
|
SPEC. The SPEC Benchmark Suite. SPEC Newsletter. 3: 3-4, 1991. -
|
| |
SPEC93
|
SPEC. SPEC: A five year retrospective. The SPEC Newsletter 5 (4): 1-4, 1993.
|
 |
Taylor90
|
George Taylor , Peter Davies , Michael Farmwald, The TLB slice—a low-cost high-speed address translation mechanism, Proceedings of the 17th annual international symposium on Computer Architecture, p.355-363, May 28-31, 1990, Seattle, Washington, United States
|
 |
Torrellas92
|
Josep Torrellas , Anoop Gupta , John Hennessy, Characterizing the caching and synchronization performance of a multiprocessor operating system, Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, p.162-174, October 12-15, 1992, Boston, Massachusetts, United States
|
| |
Torrellas95
|
|
| |
Touma92
|
|
 |
Uhlig94
|
Richard Uhlig , David Nagle , Trevor Mudge , Stuart Sechrest, Trap-driven simulation with Tapeworm II, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.132-144, October 05-07, 1994, San Jose, California, United States
|
| |
Uhlig95
|
|
| |
Wada92
|
Wada, T., Rajan, S. and Przybylski, S. An analytical access time model for on-chip cache memories. IEEE Journal of Solid-State Circuits 27 (8): 1147-1156, 1992.
|
 |
Wang89
|
|
| |
Wiecek92
|
|
| |
Wilton94
|
Wilton, S. and Jouppi, N. An enhanced access and cycle time model for on-chip caches. DEC Western Research Lab. Technical Report 93/5.1994.
|
CITED BY 29
|
|
|
|
|
J. B. Chen , Y. Endo , K. Chan , D. Mazieres , A. Dias , M. Seltzer , M. D. Smith, The measured performance of personal computer operating systems, ACM SIGOPS Operating Systems Review, v.29 n.5, p.299-313, Dec. 3, 1995
|
|
|
Chih-Chieh Lee , I-Cheng K. Chen , Trevor N. Mudge, The bi-mode branch predictor, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.4-13, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
|
|
|
|
|
J. Bradley Chen , Yasuhiro Endo , Kee Chan , David Mazières , Antonio Dias , Margo Seltzer , Michael D. Smith, The measured performance of personal computer operating systems, ACM Transactions on Computer Systems (TOCS), v.14 n.1, p.3-40, Feb. 1996
|
|
|
|
|
|
Eric Hao , Po-Yung Chang , Marius Evers , Yale N. Patt, Increasing the instruction fetch rate via block-structured instruction set architectures, Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, p.191-200, December 02-04, 1996, Paris, France
|
|
|
|
|
|
|
|
|
Stuart Sechrest , Chih-Chieh Lee , Trevor Mudge, The role of adaptivity in two-level adaptive branch prediction, Proceedings of the 28th annual international symposium on Microarchitecture, p.264-269, November 29-December 01, 1995, Ann Arbor, Michigan, United States
|
|
|
|
|
|
|
|
|
|
|
|
Theodore H. Romer , Dennis Lee , Geoffrey M. Voelker , Alec Wolman , Wayne A. Wong , Jean-Loup Baer , Brian N. Bershad , Henry M. Levy, The structure and performance of interpreters, ACM SIGPLAN Notices, v.31 n.9, p.150-159, Sept. 1996
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Thomas L. Martin , Daniel P. Siewiorek , Asim Smailagic , Matthew Bosworth , Matthew Ettus , Jolin Warren, A case study of a system-level approach to power-aware computing, ACM Transactions on Embedded Computing Systems (TECS), v.2 n.3, p.255-276, August 2003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|