|
ABSTRACT
This paper presents the Distributed Cooperative Caching, a scalable and energy-efficient scheme to manage chip multiprocessor (CMP) cache resources. The proposed configuration is based in the Cooperative Caching framework [3] but it is intended for large scale CMPs. Both centralized and distributed configurations have the advantage of combining the benefits of private and shared caches. In our proposal, the Coherence Engine has been redesigned to allow its partitioning and thus, eliminate the size constraints imposed by the duplication of all tags. At the same time, a global replacement mechanism has been added to improve the usage of cache space. Our framework uses several Distributed Coherence Engines spread across all the nodes to improve scalability. The distribution permits a better balance of the network traffic over the entire chip avoiding bottlenecks and increasing performance for a 32-core CMP by 21% over a traditional shared memory configuration and by 57% over the Cooperative Caching scheme. Furthermore, we have reduced the power consumption of the entire system by using a different tag allocation method and by reducing the number of tags compared on each request. For a 32-core CMP the Distributed Cooperative Caching framework provides an average improvement of the power/performance relation (MIPS3/W) of 3.66x over a traditional shared memory configuration and 4.30x over Cooperative Caching.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
J. Dorsey, S. Searles, M. Ciraula, S. Johnson, N. Bujanos, D. Wu, M. Braganza, S. Meyers, E. Fang, and R. Kumar. An integrated quad-core opteron processor. In ISSCC '07: IEEE International Solid-State Circuits Conference, pages 102--103, February 2007.
|
| |
8
|
P. Dubey. A platform 2015 workload model: Recognition, mining and synthesis moves computers to the era of tera. Intel White Paper, Intel Corporation, 2005.
|
| |
9
|
|
 |
10
|
Jaehyuk Huh , Changkyu Kim , Hazim Shafi , Lixin Zhang , Doug Burger , Stephen W. Keckler, A NUCA substrate for flexible CMP cache sharing, Proceedings of the 19th annual international conference on Supercomputing, June 20-22, 2005, Cambridge, Massachusetts
[doi> 10.1145/1088149.1088154]
|
 |
11
|
|
 |
12
|
Daniel Lenoski , James Laudon , Kourosh Gharachorloo , Anoop Gupta , John Hennessy, The directory-based cache coherence protocol for the DASH multiprocessor, Proceedings of the 17th annual international symposium on Computer Architecture, p.148-159, May 28-31, 1990, Seattle, Washington, United States
|
| |
13
|
Peter S. Magnusson , Magnus Christensson , Jesper Eskilson , Daniel Forsgren , Gustav Hållberg , Johan Högberg , Fredrik Larsson , Andreas Moestedt , Bengt Werner, Simics: A Full System Simulation Platform, Computer, v.35 n.2, p.50-58, February 2002
[doi> 10.1109/2.982916]
|
 |
14
|
|
 |
15
|
Milo M. K. Martin , Daniel J. Sorin , Bradford M. Beckmann , Michael R. Marty , Min Xu , Alaa R. Alameldeen , Kevin E. Moore , Mark D. Hill , David A. Wood, Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset, ACM SIGARCH Computer Architecture News, v.33 n.4, November 2005
[doi> 10.1145/1105734.1105747]
|
| |
16
|
|
| |
17
|
R. Mullins. Minimising dynamic power consumption in on-chip networks. International Symposium on System-on-Chip, pages 1--4, November 2006.
|
| |
18
|
|
| |
19
|
N. Sakran, M. Yuffe, M. Mehalel, J. Doweck, E. Knoll, and A. Kovacs. The implementation of the 65nm dual-core 64b merom processor. In ISSCC '07: IEEE International Solid-State Circuits Conference, pages 106--590, February 2007.
|
| |
20
|
|
| |
21
|
D. Tarjan, S. Thoziyoor, and N. Jouppi. Cacti 4.0. Technical report, HP Labs Palo Alto, June 2006.
|
| |
22
|
S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar. An 80-tile 1.28tflops network-on-chip in 65nm cmos. In ISSCC '07: IEEE International Solid-State Circuits Conference, February 2007.
|
| |
23
|
|
 |
24
|
|
|