|
ABSTRACT
This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another processor. Eliminating invalidation overhead is particularly important under sequential consistency, where the latency of invalidating outstanding copies can increase a program's critical path.DSI is applicable to software, hardware, and hybrid coherence schemes. In this paper we evaluate DSI in the context of hardware directory-based write-invalidate coherence protocols. Our results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. This is comparable to an implementation of weak consistency that uses a coalescing write-buffer to allow up to 16 outstanding requests for exclusive blocks. When used in conjunction with weak consistency, DSI can exploit tear-off blocks---which eliminate both invalidation and acknowledgment messages---for a total reduction in messages of up to 26%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Sarita V. Adve and Mark D. Hill. Implementing Sequential Consistency in Cache-Based Systems. In ICPP90, pages 147-I50, August 1990.
|
 |
2
|
|
 |
3
|
A. Agarwal , R. Simoni , J. Hennessy , M. Horowitz, An evaluation of directory schemes for cache coherence, Proceedings of the 15th Annual International Symposium on Computer architecture, p.280-298, May 30-June 02, 1988, Honolulu, Hawaii, United States
|
| |
4
|
Thomas E. Anderson. The Performance Implications of Spin-Waiting Alternatives for Shared-Memory Multiprocessors. In Proceedings of the 1989 International Conference on Parallel Processing (Vol. H Software), pages 11170--11174, August 1989.
|
| |
5
|
|
| |
6
|
Brian Case. SPARC V9 Adds Wealth of New Features. Microprocessor Report, 7(9), February 1993.
|
| |
7
|
L.M. Censier and P. Feautrier. A New Solution to Coherence Problems in Multicache Systems. IEEE Transacttons on Computers, C-27(t2):1112-1118, December 1978.
|
 |
8
|
David Chaiken , John Kubiatowicz , Anant Agarwal, LimitLESS directories: A scalable cache coherence scheme, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.224-234, April 08-11, 1991, Santa Clara, California, United States
|
| |
9
|
|
| |
10
|
Trishul M. Chilimbi and James R. Larus. Cachier: A Tool for Automatically Inserting CICO Annotations. In Proceedings of the 1994 h~ternational Conference on Parallel Processhzg (Vol. II Software), pages ii-89-98, August 1994.
|
| |
11
|
|
 |
12
|
|
 |
13
|
A. Krishnamurthy , D. E. Culler , A. Dusseau , S. C. Goldstein , S. Lumetta , T. von Eicken , K. Yelick, Parallel programming in Split-C, Proceedings of the 1993 ACM/IEEE conference on Supercomputing, p.262-273, December 1993, Portland, Oregon, United States
[doi> 10.1145/169627.169724]
|
| |
14
|
Ron Cytron, Steve Karlovsky, and Kevin P. McAuliffe. Automatic Management of Programmable Caches. In Proceedings of the 1988 bzternational Conference on Parallel Processing (Vol. H Software), pages 229-238, Aug 1988.
|
 |
15
|
F. Dahlgren , M. Dubois , P. Stenström, Combined performance gains of simple cache protocol extensions, Proceedings of the 21ST annual international symposium on Computer architecture, p.187-197, April 18-21, 1994, Chicago, Illinois, United States
|
 |
16
|
|
 |
17
|
|
| |
18
|
Vincent W. Freeh, David K. Lowenthal, and Gregory R. Andrews. Distributed Filaments: Efficient Fine-Grain Parallelism on a Cluster of Workstations. In Proceedings of the First USENIX Symposium on Operating Systems Design and Implementation ( OSDI), pages 201-213, November 1994.
|
 |
19
|
Kourosh Gharachorloo , Daniel Lenoski , James Laudon , Phillip Gibbons , Anoop Gupta , John Hennessy, Memory consistency and event ordering in scalable shared-memory multiprocessors, Proceedings of the 17th annual international symposium on Computer Architecture, p.15-26, May 28-31, 1990, Seattle, Washington, United States
|
 |
20
|
Anoop Gupta , John Hennessy , Kourosh Gharachorloo , Todd Mowry , Wolf-Dietrich Weber, Comparative evaluation of latency reducing and tolerating techniques, Proceedings of the 18th annual international symposium on Computer architecture, p.254-263, May 27-30, 1991, Toronto, Ontario, Canada
|
| |
21
|
|
 |
22
|
Mark D. Hill , James R. Larus , Steven K. Reinhardt , David A. Wood, Cooperative shared memory: software and hardware for scalable multiprocessor, Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, p.262-273, October 12-15, 1992, Boston, Massachusetts, United States
|
 |
23
|
|
| |
24
|
Pete Keleher, Sandhya Dwarkadas, Alan Cox, and Willy Zwanenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operations Systems. Technical Report COMP TR93-214, Department of Computer Science, Rice University, November 1993.
|
| |
25
|
Gordon Kurpanek, Ken Chan, Jason Zheng, Eric Delano, and William Bryg. PA7200: A PA-RISC Processor with Integrated High Performance MP Bus Interface. In Compcon, pages 375-382, 1994.
|
 |
26
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
| |
27
|
Leslie Lamport. How to Make a Multiprocessor Compute that Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, C-28(9):690-691, September 1979.
|
| |
28
|
|
| |
29
|
Daniel Lenoski , James Laudon , Kourosh Gharachorloo , Wolf-Dietrich Weber , Anoop Gupta , John Hennessy , Mark Horowitz , Monica S. Lam, The Stanford Dash Multiprocessor, Computer, v.25 n.3, p.63-79, March 1992
[doi> 10.1109/2.121510]
|
| |
30
|
|
| |
31
|
|
 |
32
|
Steven K. Reinhardt , Mark D. Hill , James R. Larus , Alvin R. Lebeck , James C. Lewis , David A. Wood, The Wisconsin Wind Tunnel: virtual prototyping of parallel computers, Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems, p.48-60, May 10-14, 1993, Santa Clara, California, United States
|
 |
33
|
S. K. Reinhardt , J. R. Larus , D. A. Wood, Tempest and typhoon: user-level shared memory, Proceedings of the 21ST annual international symposium on Computer architecture, p.325-336, April 18-21, 1994, Chicago, Illinois, United States
|
| |
34
|
|
| |
35
|
Ioannis Schoinas, Babak Falsafi, Alvin R. Lebeck, Steve K. Reinhardt, James R. Larus, and David A. Wood. Fine-grain Access Control for Distributed Shared Memory. Submitted for publication, March 1994.
|
 |
36
|
|
| |
37
|
B. Smith. Architecture and Applications of the HEP Multiprocessor Computer System. In Proceedings of the hTt. Soc. for Opt. Engr, pages 241-248, 1982.
|
 |
38
|
Per Stenström , Mats Brorsson , Lars Sandberg, An adaptive cache coherence protocol optimized for migratory sharing, Proceedings of the 20th annual international symposium on Computer architecture, p.109-118, May 16-19, 1993, San Diego, California, United States
|
| |
39
|
C.K. Tang. Cache System Design in the Tightly Coupled Multiprocessor System. In Proc. AFIPS, pages 749-753, 1976.
|
 |
40
|
David A. Wood , Satish Chandra , Babak Falsafi , Mark D. Hill , James R. Larus , Alvin R. Lebeck , James C. Lewis , Shubhendu S. Mukherjee , Subbarao Palacharla , Steven K. Reinhardt, Mechanisms for cooperative shared memory, Proceedings of the 20th annual international symposium on Computer architecture, p.156-167, May 16-19, 1993, San Diego, California, United States
|
CITED BY 26
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Richard P. Martin , Amin M. Vahdat , David E. Culler , Thomas E. Anderson, Effects of communication latency, overhead, and bandwidth in a cluster architecture, ACM SIGARCH Computer Architecture News, v.25 n.2, p.85-97, May 1997
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Stephen Somogyi , Thomas F. Wenisch , Nikolaos Hardavellas , Jangwoo Kim , Anastassia Ailamaki , Babak Falsafi, Memory coherence activity prediction in commercial workloads, Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture, p.37-45, June 20-20, 2004, Munich, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Michelle J. Moravan , Jayaram Bobba , Kevin E. Moore , Luke Yen , Mark D. Hill , Ben Liblit , Michael M. Swift , David A. Wood, Supporting nested transactional memory in logTM, ACM SIGPLAN Notices, v.41 n.11, November 2006
|
|
|
|
|
|
|
|
|
|
|