|
ABSTRACT
Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a line's physical location within the cache. Consequently, cache access times will become a continuum of latencies rather than a single discrete latency. This non-uniformity can be exploited to provide faster access to cache lines in the portions of the cache that reside closer to the processor. In this paper, we evaluate a series of cache designs that provides fast hits to multi-megabyte cache memories. We first propose physical designs for these Non-Uniform Cache Architectures (NUCAs). We extend these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache. We show that, for multi-megabyte level-two caches, an adaptive, dynamic NUCA design achieves 1.5 times the IPC of a Uniform Cache Architecture of any size, outperforms the best static NUCA scheme by 11%, outperforms the best three-level hierarchy--while using less silicon area--by 13%, and comes within 13% of an ideal minimal hit latency solution.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Vikas Agarwal , M. S. Hrishikesh , Stephen W. Keckler , Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th annual international symposium on Computer architecture, p.248-259, June 2000, Vancouver, British Columbia, Canada
|
| |
2
|
|
| |
3
|
D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS parallel benchmarks. Technical Report RNR-91-002 Revision 2, NASA Ames Research Laboratory, Mountain View, CA, August 1991.
|
 |
4
|
|
 |
5
|
|
| |
6
|
R. Desikan, D. Burger, S. W. Keckler, and T. M. Austin. Sim-alpha: A validated execution-driven alpha 21264 simulator. Technical Report TR-01-23, Department of Computer Sciences, University of Texas at Austin, 2001.
|
 |
7
|
|
| |
8
|
L. Gwennap. Alpha 21364 to ease memory bottleneck. Microprocessor Report, 12(14), October 1998.
|
 |
9
|
|
| |
10
|
J. M. Hill and J. Lachman. A 900MHz 2.25 MB cache with on-chip CPU now in Cu SOI. In Proceedings of the IEEE International Solid-State Circuits Conference, pages 171-177, February 2001.
|
| |
11
|
M. Horowitz, R. Ho, and K. Mai. The future of wires. In Seminconductor Research Corporation Workshop on Interconnects for Systems on a Chip, May 1999.
|
 |
12
|
M. S. Hrishikesh , Doug Burger , Norman P. Jouppi , Stephen W. Keckler , Keith I. Farkas , Premkishore Shivakumar, The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays, Proceedings of the 29th annual international symposium on Computer architecture, p.14, May 25-29, 2002, Anchorage, Alaska
|
| |
13
|
|
| |
14
|
J. Rubinstein, P. Penfield, and M. A. Horowitz. Signal delay in RC tree networks. IEEE Transactions on Computer-Aided Design, CAD-2(3):202-211, 1983.
|
 |
15
|
|
| |
16
|
N. Jouppi and S. Wilton. An enhanced access and cycle time model for on-chip caches. Technical Report TR-93-5, Compaq WRL, July 1994.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
K.-F. Lee, H.-W. Hon, and R. Reddy. An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech and Signal Processing, 38(1):35-44, 1990.
|
| |
22
|
|
| |
23
|
H. Pilo, A. Allen, J. Covino, P. Hansen, S. Lamphier, C. Murphy, T. Traver, and P. Yee. An 833MHz 1.5w 18Mb CMOS SRAM with 1.67Gb/s/pin. In Proceedings of the 2000 IEEE International Solid-State Circuits Conference, pages 266-267, February 2000.
|
| |
24
|
Michael D. Powell , Amit Agarwal , T. N. Vijaykumar , Babak Falsafi , Kaushik Roy, Reducing set-associative cache energy via way-prediction and selective direct-mapping, Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, December 01-05, 2001, Austin, Texas
|
| |
25
|
|
| |
26
|
The national technology roadmap for semiconductors. Semiconductor Industry Association, 1999.
|
| |
27
|
P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and area model. Technical report, Compaq Computer Corporation, August 2001.
|
| |
28
|
|
 |
29
|
|
| |
30
|
Standard Performance Evaluation Corporation. SPEC Newsletter, Fairfax, VA, September 2000.
|
| |
31
|
Gary Tyson , Matthew Farrens , John Matthews , Andrew R. Pleszkun, A modified approach to data cache management, Proceedings of the 28th annual international symposium on Microarchitecture, p.93-103, November 29-December 01, 1995, Ann Arbor, Michigan, United States
|
 |
32
|
|
| |
33
|
S. Wilton and N. Jouppi. Cacti: An enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits, 31(5):677-688, May 1996.
|
CITED BY 62
|
|
|
|
|
Karthikeyan Sankaralingam , Ramadass Nagarajan , Haiming Liu , Changkyu Kim , Jaehyuk Huh , Nitya Ranganathan , Doug Burger , Stephen W. Keckler , Robert G. McDonald , Charles R. Moore, TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP, ACM Transactions on Architecture and Code Optimization (TACO), v.1 n.1, p.62-93, March 2004
|
|
|
|
|
|
|
|
|
|
|
|
Jaehyuk Huh , Changkyu Kim , Hazim Shafi , Lixin Zhang , Doug Burger , Stephen W. Keckler, A NUCA substrate for flexible CMP cache sharing, Proceedings of the 19th annual international conference on Supercomputing, June 20-22, 2005, Cambridge, Massachusetts
|
|
|
Karthikeyan Sankaralingam , Ramadass Nagarajan , Haiming Liu , Changkyu Kim , Jaehyuk Huh , Doug Burger , Stephen W. Keckler , Charles R. Moore, Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture, ACM SIGARCH Computer Architecture News, v.31 n.2, May 2003
|
|
|
|
|
|
Steven Dropsho , Greg Semeraro , David H. Albonesi , Grigorios Magklis , Michael L. Scott, Dynamically Trading Frequency for Complexity in a GALS Microprocessor, Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, p.157-168, December 04-08, 2004, Portland, Oregon
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Serkan Ozdemir , Arindam Mallik , Ja Chun Ku , Gokhan Memik , Yehea Ismail, Variable latency caches for nanoscale processor, Proceedings of the 2007 ACM/IEEE conference on Supercomputing, November 10-16, 2007, Reno, Nevada
|
|
|
Shimin Chen , Phillip B. Gibbons , Michael Kozuch , Vasileios Liaskovitis , Anastassia Ailamaki , Guy E. Blelloch , Babak Falsafi , Limor Fix , Nikos Hardavellas , Todd C. Mowry , Chris Wilkerson, Scheduling threads for constructive cache sharing on CMPs, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, June 09-11, 2007, San Diego, California, USA
|
|
|
Alessandro Bardine , Pierfrancesco Foglia , Giacomo Gabrielli , Cosimo Antonio Prete, Analysis of static and dynamic energy consumption in NUCA caches: initial results, Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture, p.105-112, September 16-16, 2007, Brasov, Romania
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Aaron Smith , Jon Gibson , Bertrand Maher , Nick Nethercote , Bill Yoder , Doug Burger , Kathryn S. McKinle , Jim Burrill, Compiling for EDGE Architectures, Proceedings of the International Symposium on Code Generation and Optimization, p.185-195, March 26-29, 2006
|
|
|
Feihui Li , Chrysostomos Nicopoulos , Thomas Richardson , Yuan Xie , Vijaykrishnan Narayanan , Mahmut Kandemir, Design and Management of 3D Chip Multiprocessors Using Network-in-Memory, ACM SIGARCH Computer Architecture News, v.34 n.2, p.130-141, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Karthikeyan Sankaralingam , Ramadass Nagarajan , Robert McDonald , Rajagopalan Desikan , Saurabh Drolia , M. S. Govindan , Paul Gratz , Divya Gulati , Heather Hanson , Changkyu Kim , Haiming Liu , Nitya Ranganathan , Simha Sethumadhavan , Sadia Sharif , Premkishore Shivakumar , Stephen W. Keckler , Doug Burger, Distributed Microarchitectural Protocols in the TRIPS Prototype Processor, Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, p.480-491, December 09-13, 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dongkook Park , Soumya Eachempati , Reetuparna Das , Asit K. Mishra , Yuan Xie , N. Vijaykrishnan , Chita R. Das, MIRA: A Multi-layered On-Chip Interconnect Router Architecture, ACM SIGARCH Computer Architecture News, v.36 n.3, p.251-261, June 2008
|
|
|
|
|
|
Divya P. Gulati , Changkyu Kim , Simha Sethumadhavan , Stephen W. Keckler , Doug Burger, Multitasking workload scheduling on flexible-core chip multiprocessors, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jason Cong , Karthik Gururaj , Guoling Han , Adam Kaplan , Mishali Naik , Glenn Reinman, MC-Sim: an efficient simulation tool for MPSoC designs, Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design, November 10-13, 2008, San Jose, California
|
|
|
|
|
|
Hyunhee Kim , Sungjun Youn , Jihong Kim, A leakage-aware cache sharing technique for low-power chip multi-processors (CMPs) with private L2 caches, Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture, p.30-37, October 26-26, 2008, Toronto, Canada
|
|
|
Mark Gebhart , Bertrand A. Maher , Katherine E. Coons , Jeff Diamond , Paul Gratz , Mario Marino , Nitya Ranganathan , Behnam Robatmili , Aaron Smith , James Burrill , Stephen W. Keckler , Doug Burger , Kathryn S. McKinley, An evaluation of the TRIPS computer system, ACM SIGPLAN Notices, v.44 n.3, March 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|