|
ABSTRACT
Chip multiprocessors have the potential to exploit thread level parallelism, particularly in the context of embedded server farms where the available number of threads can be quite high. Unfortunately, both per-core and overall throughput are significantly impacted by the organization of the lowest level on-chip cache. On-chip caches for CMPs must be able to handle the increased demand and contention of multiple cores. To complicate the problem, cache demand changes dynamically with phases changes, context switches, power saving features, and assignments to asymmetric cores.We propose PDAS, a distributed NUCA L2 cache design with an adaptive sharing mechanism. Each core independently measures its dynamic need, and all cache resources are managed to increase utilization, reduce migrations, and lower interference. Per-core performance degradation is bounded while overall throughput is optimized, thus qualitatively improving performance of embedded systems where quality-of-service is an important characteristic.In single thread mode, PDAS, on average, improves by 26%, 27%, and 13% over Private, Shared, and NUCA caches respectively. This improvement is achieved while reducing internal migrations on average by 82% as compared to the NUCA. With thread contention, PDAS increases its performance and power advantage over prior work. The average migration reduction over NUCA increases to over 90%, and average IPC improvements over NUCA are 30%, 14%, and 35% for 2T, 3T, and 4T scenarios.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Vikas Agarwal , M. S. Hrishikesh , Stephen W. Keckler , Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th annual international symposium on Computer architecture, p.248-259, June 2000, Vancouver, British Columbia, Canada
|
| |
2
|
|
 |
3
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, Proceedings of the 27th annual international symposium on Computer architecture, p.282-293, June 2000, Vancouver, British Columbia, Canada
|
| |
4
|
|
| |
5
|
M. Bohr. Interconnect scaling - the real limiter to high-performance ulsi. In Tech. Dig. of the International Electron Devices Meeting, pages 241--244, December 1995.
|
| |
6
|
Broadcom. Bcm1480 product brief.
|
| |
7
|
D. C. Burger and T. M. Austin. The simplescalar tool set, version 2.0. Technical Report CS-TR-97-1342, U. of Wisconsin, Madison, June 1997.
|
| |
8
|
|
| |
9
|
D. Chiou, P. Jain, S. Devadas, and L. Rudolph. Dynamic cache partitioning via columnization. In Proceedings of Design Automation Conference, Los Angeles, June 2000.
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
S. Hily and A. Seznec. Standard memory hierarchy does not fit simultaneous multithreading. In Proceedings of MTEAC'98 Workshop, 1998.
|
| |
14
|
H. Hofstee. Power efficient processor design and the cell processor. In Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA), 2005.
|
| |
15
|
|
 |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
 |
21
|
Kunle Olukotun , Basem A. Nayfeh , Lance Hammond , Ken Wilson , Kunyung Chang, The case for a single-chip multiprocessor, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.2-11, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
22
|
K. Olukotun P. Kongetira, K. Aingaran. Niagara: A 32-way multithreaded sparc processor.
|
 |
23
|
|
 |
24
|
|
 |
25
|
|
| |
26
|
P. Shivakumar and Norman P. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. In Technical Report, 2001.
|
 |
27
|
|
| |
28
|
J. Stokes. Inside the xbox 360.
|
| |
29
|
|
 |
30
|
|
| |
31
|
|
| |
32
|
|
CITED BY 11
|
|
|
|
|
|
|
|
Ravi Iyer , Li Zhao , Fei Guo , Ramesh Illikkal , Srihari Makineni , Don Newell , Yan Solihin , Lisa Hsu , Steve Reinhardt, QoS policies and architecture for cache/memory in CMP platforms, ACM SIGMETRICS Performance Evaluation Review, v.35 n.1, June 2007
|
|
|
|
|
|
Keshavan Varadarajan , S. K. Nandy , Vishal Sharda , Amrutur Bharadwaj , Ravi Iyer , Srihari Makineni , Donald Newell, Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions, Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, p.433-442, December 09-13, 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
INDEX TERMS
Primary Classification:
C.
Computer Systems Organization
C.3
SPECIAL-PURPOSE AND APPLICATION-BASED SYSTEMS
Subjects:
Real-time and embedded systems
Additional Classification:
C.
Computer Systems Organization
C.1
PROCESSOR ARCHITECTURES
General Terms:
Design,
Performance
Keywords:
CMP,
NUCA,
PDAS,
QOS,
adaptive,
bandwidth,
cache,
chip multiprocessor,
cluster,
data-stream,
distributed,
embedded,
memory wall,
migration,
non-uniform access,
partition,
per thread degradation,
phase
|