|
ABSTRACT
Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.
Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.
Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.
Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.
Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Borg, Anita, Kessler, Rick E., Lazana, Georgia, and Wall, David W. Long Address Traces from RISC Machines: Generation and Analysis. Tech. Rept. 89114, Digital Equipment Corporation Western Research Laboratory, September, 1989.
|
| |
3
|
Digital Equipment Corporation, Inc. VAX Hardware Handbook, volume I - 1984. Maynard, Massachusetts, 1984.
|
 |
4
|
|
| |
5
|
Eustace, Alan. Private communication.
|
 |
6
|
|
| |
7
|
|
 |
8
|
|
 |
9
|
|
| |
10
|
Nielsen, Michael J. K. Titan System Manual. Tech. Rept. 86/l, Digital Equipment Corporation Western Research Laboratory, September, 1986.
|
| |
11
|
Ousterhout, John. Why Aren't Operating Systems Getting Faster As Fast As Hardware? Tech. Rept. Technote 11, Digital Equipment Corporation Western Research Laboratory, October, 1989.
|
| |
12
|
Smith, Alan J. "Sequential program prefetching in memory hierarchies. "IEEE Computer 11, 12 (December 1978), 7-21.
|
 |
13
|
|
CITED BY 303
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jared Stark , Paul Racunas , Yale N. Patt, Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.34-43, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rajeev Balasubramonian , David Albonesi , Alper Buyuktosunoglu , Sandhya Dwarkadas, Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures, Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, p.245-257, December 2000, Monterey, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jude A. Rivers , Edward S. Tam , Gary S. Tyson , Edward S. Davidson , Matt Farrens, Utilizing reuse information in data cache management, Proceedings of the 12th international conference on Supercomputing, p.449-456, July 1998, Melbourne, Australia
|
|
|
Mikko H. Lipasti , William J. Schmidt , Steven R. Kunkel , Robert R. Roediger, SPAID: software prefetching in pointer- and call-intensive environments, Proceedings of the 28th annual international symposium on Microarchitecture, p.231-236, November 29-December 01, 1995, Ann Arbor, Michigan, United States
|
|
|
|
|
|
|
|
|
Nigel Topham , Antonio González , José González, The design and performance of a conflict-avoiding cache, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.71-80, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lixin Zhang , Zhen Fang , Mide Parker , Binu K. Mathew , Lambert Schaelicke , John B. Carter , Wilson C. Hsieh , Sally A. McKee, The Impulse Memory Controller, IEEE Transactions on Computers, v.50 n.11, p.1117-1132, November 2001
|
|
|
|
|
|
|
|
|
|
|
|
K. Nakazawa , H. Nakamura , H. Imori , S. Kawabe, Pseudo vector processor based on register-windowed superscalar pipeline, Proceedings of the 1992 ACM/IEEE conference on Supercomputing, p.642-651, November 16-20, 1992, Minneapolis, Minnesota, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
P. R. Panda , F. Catthoor , N. D. Dutt , K. Danckaert , E. Brockmeyer , C. Kulkarni , A. Vandercappelle , P. G. Kjeldsberg, Data and memory optimization techniques for embedded systems, ACM Transactions on Design Automation of Electronic Systems (TODAES), v.6 n.2, p.149-206, April 2001
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tim Stanley , Michael Upton , Patrick Sherhart , Trevor Mudge , Richard Brown, A microarchitectural performance evaluation of a 3.2 Gbyte/s microprocessor bus, Proceedings of the 26th annual international symposium on Microarchitecture, p.31-40, December 01-03, 1993, Austin, Texas, United States
|
|
|
|
|
|
|
|
|
Antonio González , Mateo Valero , Nigel Topham , Joan M. Parcerisa, Eliminating cache conflict misses through XOR-based placement functions, Proceedings of the 11th international conference on Supercomputing, p.76-83, July 07-11, 1997, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sally A. McKee , Assaji Aluwihare , Benjamin H. Clark , Robert H. Klenke , Trevor C. Landon , Christopher W. Oliver , Maximo H. Salinas , Adam E. Szymkowiak , Kenneth L. Wright , Wm. A. Wulf , James H. Aylor, Design and evaluation of dynamic access ordering hardware, Proceedings of the 10th international conference on Supercomputing, p.125-132, May 25-28, 1996, Philadelphia, Pennsylvania, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hiroshi Nakamura , Taisuke Boku , Hideo Wada , Hiromitsu Imori , Ikuo Nakata , Yasuhiro Inagami , Kisaburo Nakazawa , Yoshiyuki Yamashita, A scalar architecture for pseudo vector processing based on slide-windowed registers, Proceedings of the 7th international conference on Supercomputing, p.298-307, July 19-23, 1993, Tokyo, Japan
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ken Mai , Tim Paaske , Nuwan Jayasena , Ron Ho , William J. Dally , Mark Horowitz, Smart Memories: a modular reconfigurable architecture, ACM SIGARCH Computer Architecture News, v.28 n.2, p.161-171, May 2000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Henk Muller , Dan Page , James Irwin , David May, Caches with compositional performance, Embedded processor design challenges: systems, architectures, modeling, and simulation-SAMOS, Springer-Verlag New York, Inc., New York, NY, 2002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R. Iris Bahar , Gianluca Albera , Srilatha Manne, Power and performance tradeoffs using various caching strategies, Proceedings of the 1998 international symposium on Low power electronics and design, p.64-69, August 10-12, 1998, Monterey, California, United States
|
|
|
Mark Brehob , Richard Enbody , Eric Torng , Stephen Wagner, On-line restricted caching, Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, p.374-383, January 07-09, 2001, Washington, D.C., United States
|
|
|
Gary Tyson , Matthew Farrens , John Matthews , Andrew R. Pleszkun, A modified approach to data cache management, Proceedings of the 28th annual international symposium on Microarchitecture, p.93-103, November 29-December 01, 1995, Ann Arbor, Michigan, United States
|
|
|
|
|
|
|
|
|
Teresa L. Johnson , Matthew C. Merten , Wen-Mei W. Hwu, Run-time spatial locality detection and optimization, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.57-64, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sorin Iacobovici , Lawrence Spracklen , Sudarshan Kadambi , Yuan Chou , Santosh G. Abraham, Effective stream-based and execution-based data prefetching, Proceedings of the 18th annual international conference on Supercomputing, June 26-July 01, 2004, Malo, France
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sally A. McKee , William A. Wulf , James H. Aylor , Maximo H. Salinas , Robert H. Klenke , Sung I. Hong , Dee A. B. Weikle, Dynamic Access Ordering for Streamed Computations, IEEE Transactions on Computers, v.49 n.11, p.1255-1271, November 2000
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Chi-Keung Luk , Robert Muth , Harish Patil , Richard Weiss , P. Geoffrey Lowney , Robert Cohn, Profile-guided post-link stride prefetching, Proceedings of the 16th international conference on Supercomputing, June 22-26, 2002, New York, New York, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Murali Jayapala , Francisco Barat , Tom Vander Aa , Francky Catthoor , Henk Corporaal , Geert Deconinck, Clustered Loop Buffer Organization for Low Energy VLIW Embedded Processors, IEEE Transactions on Computers, v.54 n.6, p.672-683, June 2005
|
|
|
Jaehyuk Huh , Changkyu Kim , Hazim Shafi , Lixin Zhang , Doug Burger , Stephen W. Keckler, A NUCA substrate for flexible CMP cache sharing, Proceedings of the 19th annual international conference on Supercomputing, June 20-22, 2005, Cambridge, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Chanik Park , Jaeyu Seo , Sunghwan Bae , Hyojun Kim , Shinhan Kim , Bumsoo Kim, A low-cost memory architecture with NAND XIP for mobile embedded systems, Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, October 01-03, 2003, Newport Beach, CA, USA
|
|
|
Minas Dasygenis , Erik Brockmeyer , Bart Durinck , Francky Catthoor , Dimitrios Soudris , Antonios Thanailakis, A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, v.14 n.3, p.279-291, March 2006
|
|
|
|
|
|
John B. Carter , Wilson C. Hsieh , Leigh B. Stoller , Mark Swanson , Lixin Zhang , Sally A. McKee, Impulse: Memory system support for scientific applications, Scientific Programming, v.7 n.3-4, p.195-209, August 1999
|
|
|
|
|
|
|
|
|
Christopher Batten , Ronny Krashinsky , Steve Gerding , Krste Asanovic, Cache Refill/Access Decoupling for Vector Machines, Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, p.331-342, December 04-08, 2004, Portland, Oregon
|
|
|
|
|
|
|
|
|
Luis M. Ramos , José Luis Briz , Pablo E. Ibáñez , Victor Viñals, Data prefetching in a cache hierarchy with high bandwidth and capacity, Proceedings of the 2006 workshop on MEmory performance: DEaling with Applications, systems and architectures, p.37-44, September 16-20, 2006, Seattle, Washington
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tanausú Ramírez , Alex Pajuelo , Oliverio J. Santana , Mateo Valero, Kilo-instruction processors, runahead and prefetching, Proceedings of the 3rd conference on Computing frontiers, May 03-05, 2006, Ischia, Italy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Zhen Yang , Xudong Shi , Feiqi Su , Jih-Kwon Peir, Overlapping dependent loads with addressless preload, Proceedings of the 15th international conference on Parallel architectures and compilation techniques, September 16-20, 2006, Seattle, Washington, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jan-Willem van de Waerdt , Stamatis Vassiliadis , Sanjeev Das , Sebastian Mirolo , Chris Yen , Bill Zhong , Carlos Basto , Jean-Paul van Itegem , Dinesh Amirtharaj , Kulbhushan Kalra , Pedro Rodriguez , Hans van Antwerpen, The TM3270 Media-Processor, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, p.331-342, November 12-16, 2005, Barcelona, Spain
|
|
|
|
|
|
|
|
|
|
|
|
Tom Vander Aa , Murali Jayapala , Francisco Barat , Geert Deconinck , Rudy Lauwereins , Henk Corporaal , Francky Catthoor, Instruction buffering exploration for low energy embedded processors, Journal of Embedded Computing, v.1 n.3, p.341-351, August 2005
|
|
|
|
|
|
Tong Chen , Tao Zhang , Zehra Sura , Mar Gonzales Tallada, Prefetching irregular references for software cache on cell, Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, April 05-09, 2008, Boston, MA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Akihiro Yamamoto , Yusuke Tanaka , Hideki Ando , Toshio Shimada, Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation, Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture, p.33-40, September 16-16, 2007, Brasov, Romania
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jose Baiocchi , Bruce R. Childers , Jack W. Davidson , Jason D. Hiser , Jonathan Misurda, Fragment cache management for dynamic binary translators in embedded systems with scratchpad, Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems, September 30-October 03, 2007, Salzburg, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Fang Liu , Fei Guo , Yan Solihin , Seongbeom Kim , Abdulaziz Eker, Characterizing and modeling the behavior of context switch misses, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
|
|
|
|
|
|
|
|
|
Lingxiang Xiang , Tianzhou Chen , Qingsong Shi , Wei Hu, Less reused filter: improving l2 cache performance via filtering less reused lines, Proceedings of the 23rd international conference on Supercomputing, June 08-12, 2009, Yorktown Heights, NY, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abhishek Das , Berkin Ozisikyilmaz , Serkan Ozdemir , Gokhan Memik , Joseph Zambreno , Alok Choudhary, Evaluating the effects of cache redundancy on profit, Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture, p.388-398, November 08-12, 2008
|
|
|
|
|