ACM Home Page
Please provide us with feedback. Feedback
A NUCA substrate for flexible CMP cache sharing
Full text PdfPdf (289 KB)
Source International Conference on Supercomputing archive
Proceedings of the 19th annual international conference on Supercomputing table of contents
Cambridge, Massachusetts
SESSION: Session 1: cache table of contents
Pages: 31 - 40  
Year of Publication: 2005
ISBN:1-59593-167-8
Authors
Jaehyuk Huh  The University of Texas at Austin
Changkyu Kim  The University of Texas at Austin
Hazim Shafi  Austin Research Laboratory, IBM Research
Lixin Zhang  Austin Research Laboratory, IBM Research
Doug Burger  The University of Texas at Austin
Stephen W. Keckler  The University of Texas at Austin
Sponsor
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 32,   Downloads (12 Months): 148,   Citation Count: 27
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1088149.1088154
What is a DOI?

ABSTRACT

We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees of sharing: unshared, in which each processor has a private portion of the cache, thus reducing hit latency, completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We find the optimal degree of sharing for a number of cache bank mapping policies, and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of two or four work best across a suite of commercial and scientific parallel workloads. We also demonstrate that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
V. Agarwal, S. W. Keckler, and D. Burger. The effect of technology scaling on microarchitecture structures. Technical Report TR-00-02, Department of Computer Sciences, University of Texas at Austin, May 2001.
 
2
3
 
4
 
5
6
 
7
8
9
10
 
11
R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 Chip: A dual-core multithreaded processor. IEEE Micro, 24(2), Mar/Apr 2004.
12
13
 
14
 
15
16
 
17
 
18
P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. Technical Report 2001-2, HP, Western Research Laboratory, 2001.
19
20
 
21
 
22
 
23
J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 system microarchitecture. IBM Journal of Research and Development, 46(1), 2002.
24

CITED BY  27

Collaborative Colleagues:
Jaehyuk Huh: colleagues
Changkyu Kim: colleagues
Hazim Shafi: colleagues
Lixin Zhang: colleagues
Doug Burger: colleagues
Stephen W. Keckler: colleagues