|
ABSTRACT
This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus, coherence decoupling can greatly reduce --- if not eliminate --- the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. SCL protocols reduce those latencies by speculatively writing updates into invalid lines, thereby increasing the accuracy of speculation, without complicating the simple, underlying coherence protocol that guarantees correctness.The performance benefits of coherence decoupling are evaluated using a full-system simulator and a mix of commercial and scientific benchmarks. Our results show that 40% to 90% of all coherence misses can be speculated correctly, and therefore their latencies partially or fully hidden. This capability results in performance improvements ranging from 3% to over 16%, in most cases where the latencies of coherence misses have an effect on performance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
 |
3
|
|
 |
4
|
|
 |
5
|
F. Dahlgren , M. Dubois , P. Stenström, Combined performance gains of simple cache protocol extensions, Proceedings of the 21ST annual international symposium on Computer architecture, p.187-197, April 18-21, 1994, Chicago, Illinois, United States
|
 |
6
|
Michel Dubois , Jonas Skeppstedt , Livio Ricciulli , Krishnan Ramamurthy , Per Stenström, The detection and elimination of useless misses in multiprocessors, Proceedings of the 20th annual international symposium on Computer architecture, p.88-97, May 16-19, 1993, San Diego, California, United States
|
| |
7
|
Babak Falsafi , Alvin R. Lebeck , Steven K. Reinhardt , Ioannis Schoinas , Mark D. Hill , James R. Larus , Anne Rogers , David A. Wood, Application-specific protocols for user-level shared memory, Proceedings of the 1994 conference on Supercomputing, p.380-389, December 1994, Washington, D.C., United States
|
| |
8
|
|
 |
9
|
Kourosh Gharachorloo , Daniel Lenoski , James Laudon , Phillip Gibbons , Anoop Gupta , John Hennessy, Memory consistency and event ordering in scalable shared-memory multiprocessors, Proceedings of the 17th annual international symposium on Computer Architecture, p.15-26, May 28-31, 1990, Seattle, Washington, United States
|
 |
10
|
Phillip B. Gibbons , Michael Merritt , Kourosh Gharachorloo, Proving sequential consistency of high-performance shared memories (extended abstract), Proceedings of the third annual ACM symposium on Parallel algorithms and architectures, p.292-303, July 21-24, 1991, Hilton Head, South Carolina, United States
[doi> 10.1145/113379.113406]
|
| |
11
|
P. N. Glaskowsky. IBM Raises Curtain on Power5. Microprocessor Report, Oct. 14 2003.
|
 |
12
|
Chris Gniady , Babak Falsafi , T. N. Vijaykumar, Is SC + ILP = RC?, Proceedings of the 26th annual international symposium on Computer architecture, p.162-171, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
13
|
|
 |
14
|
|
| |
15
|
IEEE. IEEE Standard for Scalable Coherent Interface (SCI), 1992. IEEE 1596--1992.
|
| |
16
|
T. Karkhanis and J. Smith. A day in the life of a cache miss. In Proceedings of 2nd Annual Workshop On Memory Performance Issues, May 2002.
|
| |
17
|
|
| |
18
|
S. Kaxiras and C. Young. Coherence communication prediction in shared-memory multiprocessors. In Proceedings of the 6th Int. Symp. on High Performance Computer Architecture, pages 156--167, Feb. 2000.
|
 |
19
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
 |
20
|
|
 |
21
|
An-Chow Lai , Babak Falsafi, Selective, accurate, and timely self-invalidation using last-touch prediction, Proceedings of the 27th annual international symposium on Computer architecture, p.139-148, June 2000, Vancouver, British Columbia, Canada
|
 |
22
|
|
| |
23
|
Daniel Lenoski , James Laudon , Kourosh Gharachorloo , Wolf-Dietrich Weber , Anoop Gupta , John Hennessy , Mark Horowitz , Monica S. Lam, The Stanford Dash Multiprocessor, Computer, v.25 n.3, p.63-79, March 1992
[doi> 10.1109/2.121510]
|
 |
24
|
|
 |
25
|
|
 |
26
|
Milo M. K. Martin , Pacia J. Harper , Daniel J. Sorin , Mark D. Hill , David A. Wood, Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors, Proceedings of the 30th annual international symposium on Computer architecture, June 09-11, 2003, San Diego, California
|
 |
27
|
|
| |
28
|
Milo M. K. Martin , Daniel J. Sorin , Harold W. Cain , Mark D. Hill , Mikko H. Lipasti, Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing, Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, December 01-05, 2001, Austin, Texas
|
 |
29
|
|
 |
30
|
Andreas Moshovos , Scott E. Breach , T. N. Vijaykumar , Gurindar S. Sohi, Dynamic speculation and synchronization of data dependences, Proceedings of the 24th annual international symposium on Computer architecture, p.181-193, June 01-04, 1997, Denver, Colorado, United States
|
| |
31
|
|
 |
32
|
|
 |
33
|
Vijay S. Pai , Parthasarathy Ranganathan , Sarita V. Adve , Tracy Harton, An evaluation of memory consistency models for shared-memory systems with ILP processors, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.12-23, October 01-04, 1996, Cambridge, Massachusetts, United States
|
 |
34
|
Y. N. Patt , W. M. Hwu , M. Shebanow, HPS, a new microarchitecture: rationale and introduction, Proceedings of the 18th annual workshop on Microprogramming, p.103-108, December 03-06, 1985, Pacific Grove, California, United States
|
| |
35
|
|
 |
36
|
|
| |
37
|
|
 |
38
|
S. K. Reinhardt , J. R. Larus , D. A. Wood, Tempest and typhoon: user-level shared memory, Proceedings of the 21ST annual international symposium on Computer architecture, p.325-336, April 18-21, 1994, Chicago, Illinois, United States
|
 |
39
|
Karthikeyan Sankaralingam , Ramadass Nagarajan , Haiming Liu , Changkyu Kim , Jaehyuk Huh , Doug Burger , Stephen W. Keckler , Charles R. Moore, Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture, Proceedings of the 30th annual international symposium on Computer architecture, June 09-11, 2003, San Diego, California
|
| |
40
|
|
 |
41
|
|
 |
42
|
Per Stenström , Mats Brorsson , Lars Sandberg, An adaptive cache coherence protocol optimized for migratory sharing, Proceedings of the 20th annual international symposium on Computer architecture, p.109-118, May 16-19, 1993, San Diego, California, United States
|
| |
43
|
|
| |
44
|
|
CITED BY 10
|
|
|
|
|
|
|
|
Jaehyuk Huh , Changkyu Kim , Hazim Shafi , Lixin Zhang , Doug Burger , Stephen W. Keckler, A NUCA substrate for flexible CMP cache sharing, Proceedings of the 19th annual international conference on Supercomputing, June 20-22, 2005, Cambridge, Massachusetts
|
|
|
|
|
|
Smruti R. Sarangi , Wei Liu, Josep Torrellas , Yuanyuan Zhou, ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, p.257-270, November 12-16, 2005, Barcelona, Spain
|
|
|
|
|
|
|
|
|
Thomas F. Wenisch , Stephen Somogyi , Nikolaos Hardavellas , Jangwoo Kim , Anastassia Ailamaki , Babak Falsafi, Temporal Streaming of Shared Memory, ACM SIGARCH Computer Architecture News, v.33 n.2, p.222-233, May 2005
|
|
|
|
|
|
|
|