ACM Home Page
Please provide us with feedback. Feedback
Improving support for locality and fine-grain sharing in chip multiprocessors
Full text PdfPdf (265 KB)
Source
PACT archive
Proceedings of the 17th international conference on Parallel architectures and compilation techniques table of contents
Toronto, Ontario, Canada
SESSION: Multicore memory hierarchy design (part 1) table of contents
Pages 155-165  
Year of Publication: 2008
ISBN:978-1-60558-282-5
Authors
Hemayet Hossain  University of Rochester, Rochester, NY, USA
Sandhya Dwarkadas  University of Rochester, Rochester, NY, USA
Michael C. Huang  University of Rochester, Rochester, NY, USA
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 21,   Downloads (12 Months): 188,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1454115.1454138
What is a DOI?

ABSTRACT

Both commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer's primary cache when accessed. Chip multiprocessors present an opportunity to optimize for fine-grain sharing using direct access to remote processor components through low-latency on-chip interconnects. In this paper, we present Adaptive Replication, Migration, and producer-Consumer Optimization (ARMCO), a coherence protocol that, to the best of our knowledge, is the first to exploit direct access to the L1 caches of remote processors (rather than via coherence mechanisms) in order to support fine-grain sharing.

Our goal is to provide support for tightly coupled sharing by recognizing and adapting to common sharing patterns such as migratory, producer-consumer, multiple-reader, and multiple read-write. The protocol places data close to where it is most needed and leverages direct access when following conventional coherence actions proves wasteful. Via targeted optimizations for each of these access patterns, our proposed protocol is able to reduce the average access latency and increase the effective cache capacity at the L1 level with on-chip storage overhead as low as 0.38%. Full-system simulations of 16-processor CMPs show an average (geometric mean) speedup of 1.13 (ranging from 1.04 to 2.26) for 12 commercial, scientific, and mining workloads, with an average of 1.18 if we include 2 microbenchmarks. ARMCO also reduces the on-chip bandwidth requirements and dynamic energy (power) consumption by an average of 33.3% and 31.2% (20.2%) respectively. By evaluating optimizations at both the L1 and the L2 level, we demonstrate that when considering performance, optimization at the L1 level is more effective at supporting fine-grain sharing than that at the L2 level.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
 
5
6
 
7
8
 
9
 
10
Intel Corporation. Introducing the 45nm Next-Generation Intel Core Microarchitecture. http://www.intel.com/technology/architecture-silicon/intel64/45nmcore2_whitepaper.pdf.
 
11
Intel Corporation. Power delivery for high-performance microprocessors. http://www.intel.com/technology/itj/2005/volume09issue04/art02_powerdelivery/p03_powerdelivery.htm, 2005.
 
12
Standard Performance Evaluation Corporation. Specjbb2005. http://www.spec.org/jbb2005/, 2005.
13
 
14
15
 
16
The Apache Software Foundation. Apache. http://www.apache.org/, 2008.
 
17
R. Garg, A. El-Moursy, S. Dwarkadas, D. Albonesi, J. Rivers, and V. Srinivasan. Cache Design Options for a Clustered Multithreaded Architecture. Technical Report TR 866, Dept. of Computer Science, University of Rochester, Aug. 2005.
 
18
 
19
S. Kaxiras and C. Young. Coherence communication prediction in shared-memory multiprocessors. In Proceedings of the 6th International Symposium on High Performance Computer Architecture, pages 156--167, Jan. 2000.
20
 
21
22
 
23
24
25
 
26
 
27
R. Merritt. Ibm weaves multithreading into power5. EE Times, 2003.
 
28
29
 
30
 
31
Intel News Release. Intel research advances "era of tera". http://www.intel.com/pressroom/archive/releases/20070204comp.htm, Feb. 2007.
 
32
Sun News Release. Sun expands solaris/sparc cmt innovation leadership. http://www.sun.com/aboutsun/pr/2007-01/sunflash.20070118.3.xml, Jan. 2007.
33
34
35
 
36
37

Collaborative Colleagues:
Hemayet Hossain: colleagues
Sandhya Dwarkadas: colleagues
Michael C. Huang: colleagues