|
ABSTRACT
Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts. In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
M. Azimi, N. Cherukuri, D. N. Jayasimha, A. Kumar, P. Kundu, S. Park, I. Schoinas, and A. S. Vaidya. Integration challenges and trade-offs for tera-scale architectures. Intel Technology Journal, 11(3):173-184, Aug. 2007.
|
 |
2
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, Proceedings of the 27th annual international symposium on Computer architecture, p.282-293, June 2000, Vancouver, British Columbia, Canada
|
 |
3
|
|
| |
4
|
|
| |
5
|
|
 |
6
|
Luis Ceze , James Tuck , Pablo Montesinos , Josep Torrellas, BulkSC: bulk enforcement of sequential consistency, Proceedings of the 34th annual international symposium on Computer architecture, June 09-13, 2007, San Diego, California, USA
|
 |
7
|
|
| |
8
|
G. Chen, H. Chen, M. Haurylau, N. Nelson, P. M. Fauchet, E. G. Friedman, and D. H. Albonesi. Electrical and optical on-chip interconnects in scaled microprocessors. In Proceedings of the 2005 IEEE International Symposium on Circuits and Systems, May 2005.
|
| |
9
|
|
 |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
W. J. Dally and C. L. Seitz. The torus routing chip. Distributed Computing, 1(4):187--196, Dec. 1986.
|
| |
14
|
|
| |
15
|
|
 |
16
|
|
| |
17
|
C. Fensch and M. Cintra. An OS-based alternative to full hardware coherence on tiled CMPs. In Proceedings of the 14th International Symposium on High-Performance Computer Architecture, Feb. 2008.
|
| |
18
|
K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance the performance of memory consistency models. In Proceedings of the 1991 International Conference on Parallel Processing (Vol. I Architecture), Aug. 1991.
|
 |
19
|
|
| |
20
|
N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril, A. Ailamaki, and B. Falsafi. Database servers on chip multiprocessors: limitations and opportunities. In Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research, Jan. 2007.
|
 |
21
|
Jaehyuk Huh , Changkyu Kim , Hazim Shafi , Lixin Zhang , Doug Burger , Stephen W. Keckler, A NUCA substrate for flexible CMP cache sharing, Proceedings of the 19th annual international conference on Supercomputing, June 20-22, 2005, Cambridge, Massachusetts
[doi> 10.1145/1088149.1088154]
|
 |
22
|
|
| |
23
|
|
 |
24
|
|
| |
25
|
|
| |
26
|
|
 |
27
|
|
 |
28
|
|
 |
29
|
|
 |
30
|
Parthasarathy Ranganathan , Kourosh Gharachorloo , Sarita V. Adve , Luiz André Barroso, Performance of database workloads on shared-memory systems with out-of-order processors, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.307-318, October 02-07, 1998, San Jose, California, United States
|
| |
31
|
R. Ricci, S. Barrus, D. Gebhardt, and R. Balasubramonian. Leveraging bloom filters for smart search within NUCA caches. In Proceedings of the 2006 Workshop on Complexity-Effective Design, Jun. 2006.
|
 |
32
|
Larry Seiler , Doug Carmean , Eric Sprangle , Tom Forsyth , Michael Abrash , Pradeep Dubey , Stephen Junkins , Adam Lake , Jeremy Sugerman , Robert Cavin , Roger Espasa , Ed Grochowski , Toni Juan , Pat Hanrahan, Larrabee: a many-core x86 architecture for visual computing, ACM SIGGRAPH 2008 papers, August 11-15, 2008, Los Angeles, California
|
| |
33
|
Semiconductor Industry Association. The International Technology Roadmap for Semiconductors (ITRS). http://www.itrs.net/, 2007 Edition.
|
 |
34
|
|
 |
35
|
Stephen Somogyi , Thomas F. Wenisch , Nikolaos Hardavellas , Jangwoo Kim , Anastassia Ailamaki , Babak Falsafi, Memory coherence activity prediction in commercial workloads, Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture, p.37-45, June 20-20, 2004, Munich, Germany
[doi> 10.1145/1054943.1054949]
|
| |
36
|
D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared L2 caches on multicore systems in software. In Proceedings of the Workshop on the Interaction between Operating Systems and Computer Architecture, Jun. 2007.
|
 |
37
|
|
 |
38
|
Thomas F. Wenisch , Anastasia Ailamaki , Babak Falsafi , Andreas Moshovos, Mechanisms for store-wait-free multiprocessors, Proceedings of the 34th annual international symposium on Computer architecture, June 09-13, 2007, San Diego, California, USA
|
| |
39
|
Thomas F. Wenisch , Roland E. Wunderlich , Michael Ferdman , Anastassia Ailamaki , Babak Falsafi , James C. Hoe, SimFlex: Statistical Sampling of Computer System Simulation, IEEE Micro, v.26 n.4, p.18-31, July 2006
[doi> 10.1109/MM.2006.79]
|
 |
40
|
|
| |
41
|
M. Zhang and K. Asanovic. Victim migration: Dynamically adapting between private and shared CMP caches. Technical Report MIT-CSAIL-TR-2005-064, MIT, Oct. 2005.
|
 |
42
|
|
| |
43
|
|
 |
44
|
|
INDEX TERMS
Primary Classification:
B.
Hardware
B.3
MEMORY STRUCTURES
B.3.2
Design Styles
Subjects:
Cache memories
Additional Classification:
B.
Hardware
B.3
MEMORY STRUCTURES
B.3.2
Design Styles
Subjects:
Interleaved memories**;
Shared memory
General Terms:
Design,
Experimentation,
Performance
Keywords:
block migration,
block placement,
block replication,
cache,
cache coherence,
cache indexing,
cache lookup,
cache management,
chip multiprocessor,
cmp,
coherence,
data migration,
data placement,
data replication,
interleaving,
last-level cache,
lookup,
migration,
multi-core,
multicore,
non-uniform cache access,
nuca,
placement,
private cache,
r-nuca,
reactive nuca,
replication,
rotational interleaving,
shared cache
|