|
ABSTRACT
Hardware distributed shared memory (DSM) systems efficiently support fine grain sharing of data by maintaining coherence at the level of individual cache lines and providing automatic replication in processor caches. Software DSM systems, on the other hand, amortize high communication costs by maintaining coherence at coarser granularities and replicating data at the level of local main memories. Even though software DSM systems have traditionally been targeted towards loosely coupled environments, some of the techniques are potentially useful in the context of tightly coupled multiprocessors. In particular, communicating data at a coarse grain can sometimes be more efficient than transferring the data as individual cache lines. Furthermore, replication in local memories can accommodate applications with larger working sets as compared to replication in processor caches only. Therefore, combining the two techniques in a hybrid protocol can potentially exploit the benefits of each approach.
This paper proposes one such hybrid protocol and evaluates its performance in the context of the FLASH multiprocessor architecture. The hybrid system allows the programmer to optionally identify regions of data shared at a coarse granularity. Coherence for such data is maintained at the grain of the entire region using a software-DSM-style protocol. We evaluate the performance gains of this approach through a detailed simulation study of several parallel applications. Our preliminary results show that the hybrid protocol can eliminate a substantial fraction of remote cache misses through bulk transfer of coarse grain data regions and replication of such data in local memories. The performance gains over hardware cache coherence are modest at low network latencies, but increase substantially at higher network latencies and processor speeds. Finally, we show that similar to cache-only memory architectures, the hybrid protocol is insensitive to data placement issues.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
Anant Agarwal , Beng-Hong Lim , David Kranz , John Kubiatowicz, APRIL: a processor architecture for multiprocessing, Proceedings of the 17th annual international symposium on Computer Architecture, p.104-114, May 28-31, 1990, Seattle, Washington, United States
|
 |
3
|
|
| |
4
|
Jean-Loup Baer and Tien-Fu Chen, An effective on-chip preloading scheme to reduce data access penalty. Technical Report 91-03-07, University of Washington, March 1991.
|
| |
5
|
B. Bershad and M. Zekauskas. Midway: Shared memory parallel programming with entry consistency for distributed memory multiprocessors. Technical Report CMU-CS-91-170~ Camegie-MeUon University, September 1991.
|
| |
6
|
Brian Bershad, Matthew Zekauskas, and Wayne Sawdon. The Midway distributed shared memory system. In Proceedings of COMP- CON'93, pages 528-537, February 1993.
|
 |
7
|
John B. Carter , John K. Bennett , Willy Zwaenepoel, Implementation and performance of Munin, Proceedings of the thirteenth ACM symposium on Operating systems principles, p.152-164, October 13-16, 1991, Pacific Grove, California, United States
|
 |
8
|
David Chaiken , John Kubiatowicz , Anant Agarwal, LimitLESS directories: A scalable cache coherence scheme, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.224-234, April 08-11, 1991, Santa Clara, California, United States
|
| |
9
|
Rohit Chandra, Anoop Gupta, and John L. Hennessy. Integrating concurrency and data abstraction in the COOL programming language. IEEE Computer, August 1994. To appear.
|
 |
10
|
|
 |
11
|
|
 |
12
|
|
 |
13
|
Sandhya Dwarkadas , Peter Keleher , Alan L. Cox , Willy Zwaenepoel, Evaluation of release consistent software distributed shared memory on emerging network technology, Proceedings of the 20th annual international symposium on Computer architecture, p.144-155, May 16-19, 1993, San Diego, California, United States
|
 |
14
|
|
 |
15
|
Michael J. Feeley , Henry M. Levy, Distributed shared memory with versioned objects, conference proceedings on Object-oriented programming systems, languages, and applications, p.247-262, October 18-22, 1992, Vancouver, British Columbia, Canada
|
 |
16
|
Kourosh Gharachorloo , Anoop Gupta , John Hennessy, Performance evaluation of memory consistency models for shared-memory multiprocessors, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.245-257, April 08-11, 1991, Santa Clara, California, United States
|
 |
17
|
Kourosh Gharachorloo , Daniel Lenoski , James Laudon , Phillip Gibbons , Anoop Gupta , John Hennessy, Memory consistency and event ordering in scalable shared-memory multiprocessors, Proceedings of the 17th annual international symposium on Computer Architecture, p.15-26, May 28-31, 1990, Seattle, Washington, United States
|
 |
18
|
|
| |
19
|
|
 |
20
|
|
 |
21
|
|
 |
22
|
David Kranz , Kirk Johnson , Anant Agarwal , John Kubiatowicz , Beng-Hong Lim, Integrating message-passing and shared-memory: early experience, Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.54-63, May 19-22, 1993, San Diego, California, United States
|
 |
23
|
|
 |
24
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
| |
25
|
Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):690--691, September 1979.
|
 |
26
|
Daniel Lenoski , James Laudon , Kourosh Gharachorloo , Anoop Gupta , John Hennessy, The directory-based cache coherence protocol for the DASH multiprocessor, Proceedings of the 17th annual international symposium on Computer Architecture, p.148-159, May 28-31, 1990, Seattle, Washington, United States
|
| |
27
|
Daniel Lenoski , James Laudon , Kourosh Gharachorloo , Wolf-Dietrich Weber , Anoop Gupta , John Hennessy , Mark Horowitz , Monica S. Lam, The Stanford Dash Multiprocessor, Computer, v.25 n.3, p.63-79, March 1992
[doi> 10.1109/2.121510]
|
 |
28
|
|
 |
29
|
|
| |
30
|
|
 |
31
|
|
| |
32
|
Kendall Square Research. KSR1 Technical Summary. Waltham, MA, 1992.
|
| |
33
|
Martin Rinard. The Design and Implementation of Jade, a high-level Portable Parallel Programming Language. PhD thesis, Department of Computer Science, Stanford University. In preparation.
|
| |
34
|
|
| |
35
|
|
| |
36
|
Edward Rothberg and Anoop Gupta. An efficient block-oriented approach to parallel sparse cholesky factorization. Technical Report CSL-TR-92-533, Computer Systems Lab, Stanford University, July 1992.
|
 |
37
|
Harjinder S. Sandhu , Benjamin Gamsa , Songnian Zhou, The shared regions approach to software cache coherence on multiprocessors, Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.229-238, May 19-22, 1993, San Diego, California, United States
|
 |
38
|
|
 |
39
|
|
| |
40
|
Josep Torrellas, Monica Lam, and John Hennessy. Shared data placement optimizat~ons to reduce multiprocessor cache miss rates. In Proceedings of the 1990 International Conference on Parallel Processing, pages II: 266-270, August 1990.
|
| |
41
|
|
|