|
ABSTRACT
Wire delays continue to grow as the dominant component oflatency for large caches.A recent work proposed an adaptive,non-uniform cache architecture (NUCA) to manage large, on-chipcaches.By exploiting the variation in access time acrosswidely-spaced subarrays, NUCA allows fast access to closesubarrays while retaining slow access to far subarrays.Whilethe idea of NUCA is attractive, NUCA does not employ designchoices commonly used in large caches, such as sequential tag-dataaccess for low power.Moreover, NUCA couples dataplacement with tag placement foregoing the flexibility of dataplacement and replacement that is possible in a non-uniformaccess cache.Consequently, NUCA can place only a few blockswithin a given cache set in the fastest subarrays, and mustemploy a high-bandwidth switched network to swap blockswithin the cache for high performance.In this paper, we proposethe Non-uniform access with Replacement And PlacementusIng Distance associativity" cache, or NuRAPID, whichleverages sequential tag-data access to decouple data placementfrom tag placement.Distance associativity, the placementof data at a certain distance (and latency), is separated from setassociativity, the placement of tags within a set.This decouplingenables NuRAPID to place flexibly the vast majority offrequently-accessed data in the fastest subarrays, with fewerswaps than NUCA.Distance associativity fundamentallychanges the trade-offs made by NUCA's best-performingdesign, resulting in higher performance and substantiallylower cache energy.A one-ported, non-banked NuRAPIDcache improves performance by 3% on average and up to 15%compared to a multi-banked NUCA with an infinite-bandwidthswitched network, while reducing L2 cache energy by 77%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
[2] D. Burger and T. M. Austin. The SimpleScalar tool set, version 2.0. Technical Report 1342, Computer Sciences Department, University of Wisconsin-Madison, June 1997.
|
| |
3
|
John H. Edmondson , Paul I. Rubinfeld , Peter J. Bannon , Bradley J. Benschneider , Debra Bernstein , Ruben W. Castelino , Elizabeth M. Cooper , Daniel E. Dever , Dale R. Donchin , Timothy C. Fischer , Anil K. Jain , Shekhar Mehta , Jeanne E. Meyer , Ronald P. Preston , Vidya Rajagopalan , Chandrasekhara Somanathan , Scott A. Taylor , Gilbert M. Wolrich, Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor, Digital Technical Journal, v.7 n.1, p.119-135, Jan. 1995
|
 |
4
|
|
 |
5
|
|
 |
6
|
|
 |
7
|
|
| |
8
|
[8] S. D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. Sullivan, and T. Grutkowski. The implementation of the Itanium 2 microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1448-1460, Nov. 2002.
|
| |
9
|
Michael D. Powell , Amit Agarwal , T. N. Vijaykumar , Babak Falsafi , Kaushik Roy, Reducing set-associative cache energy via way-prediction and selective direct-mapping, Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, December 01-05, 2001, Austin, Texas
|
| |
10
|
[10] S. A. Przybylski. Performance-directed memory hierarchy design. Technical Report 366, Stanford University, Sept. 1988.
|
| |
11
|
[11] P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and area model. Technical report, Compaq Computer Corporation, Aug. 2001.
|
 |
12
|
|
| |
13
|
Gary Tyson , Matthew Farrens , John Matthews , Andrew R. Pleszkun, A modified approach to data cache management, Proceedings of the 28th annual international symposium on Microarchitecture, p.93-103, November 29-December 01, 1995, Ann Arbor, Michigan, United States
|
| |
14
|
[14] D. Weiss, J. J. Wuu, and V. Chin. The on-chip 3-mb subarray-based third-level cache on an Itanium microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1523-1529, Nov. 2002.
|
CITED BY 21
|
|
|
|
|
Jaehyuk Huh , Changkyu Kim , Hazim Shafi , Lixin Zhang , Doug Burger , Stephen W. Keckler, A NUCA substrate for flexible CMP cache sharing, Proceedings of the 19th annual international conference on Supercomputing, June 20-22, 2005, Cambridge, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
Serkan Ozdemir , Arindam Mallik , Ja Chun Ku , Gokhan Memik , Yehea Ismail, Variable latency caches for nanoscale processor, Proceedings of the 2007 ACM/IEEE conference on Supercomputing, November 10-16, 2007, Reno, Nevada
|
|
|
Alessandro Bardine , Pierfrancesco Foglia , Giacomo Gabrielli , Cosimo Antonio Prete, Analysis of static and dynamic energy consumption in NUCA caches: initial results, Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture, p.105-112, September 16-16, 2007, Brasov, Romania
|
|
|
|
|
|
Feihui Li , Chrysostomos Nicopoulos , Thomas Richardson , Yuan Xie , Vijaykrishnan Narayanan , Mahmut Kandemir, Design and Management of 3D Chip Multiprocessors Using Network-in-Memory, ACM SIGARCH Computer Architecture News, v.34 n.2, p.130-141, May 2006
|
|
|
|
|
|
Keshavan Varadarajan , S. K. Nandy , Vishal Sharda , Amrutur Bharadwaj , Ravi Iyer , Srihari Makineni , Donald Newell, Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions, Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, p.433-442, December 09-13, 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Li Zhao , Ravi Iyer , Jaideep Moses , Ramesh Illikkal , Srihari Makineni , Don Newell, Exploring Large-Scale CMP Architectures Using ManySim, IEEE Micro, v.27 n.4, p.21-33, July 2007
|
|
|
|
|
|
|
|
|
|
|