|
ABSTRACT
Both commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer's primary cache when accessed. Chip multiprocessors present an opportunity to optimize for fine-grain sharing using direct access to remote processor components through low-latency on-chip interconnects. In this paper, we present Adaptive Replication, Migration, and producer-Consumer Optimization (ARMCO), a coherence protocol that, to the best of our knowledge, is the first to exploit direct access to the L1 caches of remote processors (rather than via coherence mechanisms) in order to support fine-grain sharing. Our goal is to provide support for tightly coupled sharing by recognizing and adapting to common sharing patterns such as migratory, producer-consumer, multiple-reader, and multiple read-write. The protocol places data close to where it is most needed and leverages direct access when following conventional coherence actions proves wasteful. Via targeted optimizations for each of these access patterns, our proposed protocol is able to reduce the average access latency and increase the effective cache capacity at the L1 level with on-chip storage overhead as low as 0.38%. Full-system simulations of 16-processor CMPs show an average (geometric mean) speedup of 1.13 (ranging from 1.04 to 2.26) for 12 commercial, scientific, and mining workloads, with an average of 1.18 if we include 2 microbenchmarks. ARMCO also reduces the on-chip bandwidth requirements and dynamic energy (power) consumption by an average of 33.3% and 31.2% (20.2%) respectively. By evaluating optimizations at both the L1 and the L2 level, we demonstrate that when considering performance, optimization at the L1 level is more effective at supporting fine-grain sharing than that at the L2 level.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alaa R. Alameldeen , Milo M. K. Martin , Carl J. Mauer , Kevin E. Moore , Min Xu , Mark D. Hill , David A. Wood , Daniel J. Sorin, Simulating a $2M Commercial Server on a $2K PC, Computer, v.36 n.2, p.50-57, February 2003
[doi> 10.1109/MC.2003.1178046]
|
 |
2
|
Thomas E. Anderson , Michael D. Dahlin , Jeanna M. Neefe , David A. Patterson , Drew S. Roselli , Randolph Y. Wang, Serverless network file systems, ACM Transactions on Computer Systems (TOCS), v.14 n.1, p.41-79, Feb. 1996
[doi> 10.1145/225535.225537]
|
 |
3
|
|
| |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
|
 |
8
|
|
| |
9
|
|
| |
10
|
Intel Corporation. Introducing the 45nm Next-Generation Intel Core Microarchitecture. http://www.intel.com/technology/architecture-silicon/intel64/45nmcore2_whitepaper.pdf.
|
| |
11
|
Intel Corporation. Power delivery for high-performance microprocessors. http://www.intel.com/technology/itj/2005/volume09issue04/art02_powerdelivery/p03_powerdelivery.htm, 2005.
|
| |
12
|
Standard Performance Evaluation Corporation. Specjbb2005. http://www.spec.org/jbb2005/, 2005.
|
 |
13
|
|
| |
14
|
|
 |
15
|
M. J. Feeley , W. E. Morgan , E. P. Pighin , A. R. Karlin , H. M. Levy , C. A. Thekkath, Implementing global memory management in a workstation cluster, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.201-212, December 03-06, 1995, Copper Mountain, Colorado, United States
|
| |
16
|
The Apache Software Foundation. Apache. http://www.apache.org/, 2008.
|
| |
17
|
R. Garg, A. El-Moursy, S. Dwarkadas, D. Albonesi, J. Rivers, and V. Srinivasan. Cache Design Options for a Clustered Multithreaded Architecture. Technical Report TR 866, Dept. of Computer Science, University of Rochester, Aug. 2005.
|
| |
18
|
|
| |
19
|
S. Kaxiras and C. Young. Coherence communication prediction in shared-memory multiprocessors. In Proceedings of the 6th International Symposium on High Performance Computer Architecture, pages 156--167, Jan. 2000.
|
 |
20
|
|
| |
21
|
|
 |
22
|
|
| |
23
|
Peter S. Magnusson , Magnus Christensson , Jesper Eskilson , Daniel Forsgren , Gustav Hållberg , Johan Högberg , Fredrik Larsson , Andreas Moestedt , Bengt Werner, Simics: A Full System Simulation Platform, Computer, v.35 n.2, p.50-58, February 2002
[doi> 10.1109/2.982916]
|
 |
24
|
Milo M. K. Martin , Pacia J. Harper , Daniel J. Sorin , Mark D. Hill , David A. Wood, Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors, Proceedings of the 30th annual international symposium on Computer architecture, June 09-11, 2003, San Diego, California
|
 |
25
|
Milo M. K. Martin , Daniel J. Sorin , Bradford M. Beckmann , Michael R. Marty , Min Xu , Alaa R. Alameldeen , Kevin E. Moore , Mark D. Hill , David A. Wood, Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset, ACM SIGARCH Computer Architecture News, v.33 n.4, November 2005
[doi> 10.1145/1105734.1105747]
|
| |
26
|
|
| |
27
|
R. Merritt. Ibm weaves multithreading into power5. EE Times, 2003.
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
| |
31
|
Intel News Release. Intel research advances "era of tera". http://www.intel.com/pressroom/archive/releases/20070204comp.htm, Feb. 2007.
|
| |
32
|
Sun News Release. Sun expands solaris/sparc cmt innovation leadership. http://www.sun.com/aboutsun/pr/2007-01/sunflash.20070118.3.xml, Jan. 2007.
|
 |
33
|
Per Stenström , Mats Brorsson , Lars Sandberg, An adaptive cache coherence protocol optimized for migratory sharing, Proceedings of the 20th annual international symposium on Computer architecture, p.109-118, May 16-19, 1993, San Diego, California, United States
|
 |
34
|
Alec Wolman , M. Voelker , Nitin Sharma , Neal Cardwell , Anna Karlin , Henry M. Levy, On the scale and performance of cooperative Web proxy caching, Proceedings of the seventeenth ACM symposium on Operating systems principles, p.16-31, December 12-15, 1999, Charleston, South Carolina, United States
|
 |
35
|
Alec Wolman , M. Voelker , Nitin Sharma , Neal Cardwell , Anna Karlin , Henry M. Levy, On the scale and performance of cooperative Web proxy caching, Proceedings of the seventeenth ACM symposium on Operating systems principles, p.16-31, December 12-15, 1999, Charleston, South Carolina, United States
|
| |
36
|
|
 |
37
|
|
|