|
ABSTRACT
Parallel workstations, each comprising tens of processors based on shared memory, promise cost-effective scalable multiprocessing. This article explores the coupling of such small- to medium-scale shared-memory multiprocessors through software over a local area network to synthesize larger shared-memory systems. We call these systems Distributed Shared-memory MultiProcessors (DSMPs). This article introduces the design of a shared-memory system that uses multiple granularities of sharing, called MGS, and presents a prototype implementation of MGS on the MIT Alewife multiprocessor. Multigrain shared memory enables the collaboration of hardware and software shared memory, thus synthesizing a single transparent shared-memory address space across a cluster of multiprocessors. The system leverages the efficient support for fine-grain cache-line sharing within multiprocessor nodes as often as possible, and resorts to coarse-grain page-level sharing across nodes only when absolutely necessary. Using our prototype implementation of MGS, an in-depth study of several shared-memory application is conducted to understand the behavior of DSMPs. Our study is the first to comprehensively explore the DSMP design space, and teh compare the performance of DSMPs against all-software and all-hardware DSMs on a signle experimental platform. Keeping the total number of processors fixed, we show that applications execute up to 85% faster on a DSMP as compared to an all-software DSM. We also show that all-hardware DSMs hold a significant performance advantage over DSMPs on challenging applications, between 159% and 1014%. However, program transformations to improve data locality for these applications allow DSMPs to almost match the performance of an all-hardware multiprocessor of the same size.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Anant Agarwal , Ricardo Bianchini , David Chaiken , Kirk L. Johnson , David Kranz , John Kubiatowicz , Beng-Hong Lim , Kenneth Mackenzie , Donald Yeung, The MIT Alewife machine: architecture and performance, Proceedings of the 22nd annual international symposium on Computer architecture, p.2-13, June 22-24, 1995, S. Margherita Ligure, Italy
|
| |
2
|
BERSHAD, B. N. AND ZEKAUSKAS, M.J. 1991. Midway: Shared memory parallel programming with entry consistency for distributed memory multiprocessors. CMU-CS-91-170. Computer Science Department, Carnegie Mellon University, Pittsburgh, PA.
|
 |
3
|
D. L. Black , R. F. Rashid , D. B. Golub , C. R. Hill, Translation lookaside buffer consistency: a software approach, Proceedings of the third international conference on Architectural support for programming languages and operating systems, p.113-122, April 03-06, 1989, Boston, Massachusetts, United States
|
 |
4
|
John B. Carter , John K. Bennett , Willy Zwaenepoel, Implementation and performance of Munin, Proceedings of the thirteenth ACM symposium on Operating systems principles, p.152-164, October 13-16, 1991, Pacific Grove, California, United States
|
 |
5
|
David Chaiken , John Kubiatowicz , Anant Agarwal, LimitLESS directories: A scalable cache coherence scheme, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.224-234, April 08-11, 1991, Santa Clara, California, United States
|
| |
6
|
Cox, A. L. AND FOWLER, R.g. 1989. The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with PLATINUM. Tech. Rep. 263. Dept. of Computer Science, University of Rochester, Rochester, NY.
|
 |
7
|
A. L. Cox , S. Dwarkadas , P. Keleher , H. Lu , R. Rajamony , W. Zwaenepoel, Software versus hardware shared-memory implementation: a case study, Proceedings of the 21ST annual international symposium on Computer architecture, p.106-117, April 18-21, 1994, Chicago, Illinois, United States
|
 |
8
|
Andrew Erlichson , Neal Nuckolls , Greg Chesson , John Hennessy, SoftFLASH: analyzing the performance of clustered distributed virtual shared memory, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.210-220, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
9
|
|
| |
10
|
GILLETT, R. 1996. Memory channel: An optimzed cluster interconnect. IEEE Micro 16, 2 (Apr.).
|
 |
11
|
K. L. Johnson , M. F. Kaashoek , D. A. Wallach, CRL: high-performance all-software distributed shared memory, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.213-226, December 03-06, 1995, Copper Mountain, Colorado, United States
|
| |
12
|
|
 |
13
|
|
| |
14
|
KELEHER, P., DWARKADAS, S., Cox, A., AND ZWAENEPOEL, W. 1994. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the Winter 1994 USENIX Conference (Jan.), USENIX Assoc., Berkeley, CA, 115-131.
|
 |
15
|
David Kranz , Kirk Johnson , Anant Agarwal , John Kubiatowicz , Beng-Hong Lim, Integrating message-passing and shared-memory: early experience, Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.54-63, May 19-22, 1993, San Diego, California, United States
|
 |
16
|
|
 |
17
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
 |
18
|
|
 |
19
|
|
 |
20
|
Shubhendu S. Mukherjee , Shamik D. Sharma , Mark D. Hill , James R. Larus , Anne Rogers , Joel Saltz, Efficient support for irregular applications on distributed-memory machines, Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.68-79, July 19-21, 1995, Santa Barbara, California, United States
|
 |
21
|
|
| |
22
|
|
| |
23
|
|
 |
24
|
Daniel J. Scales , Kourosh Gharachorloo , Chandramohan A. Thekkath, Shasta: a low overhead, software-only approach for supporting fine-grain shared memory, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.174-185, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
25
|
|
| |
26
|
SINGH, J. P., HOLT, C., TOTSUKA, T., GUPTA, A., AND HENNESSY, J. L. 1992a. Load balancing and data locality in hierarchical N-body methods. Tech. Rep. CSL-TR-92-505. Computer Systems Laboratory, Stanford Univ., Stanford, CA.
|
| |
27
|
|
 |
28
|
Robert Stets , Sandhya Dwarkadas , Nikolaos Hardavellas , Galen Hunt , Leonidas Kontothanassis , Srinivasan Parthasarathy , Michael Scott, Cashmere-2L: software coherent shared memory on a clustered remote-write network, Proceedings of the sixteenth ACM symposium on Operating systems principles, p.170-183, October 05-08, 1997, Saint Malo, France
|
| |
29
|
SUN MICROSYSTEMS. 1996. The Ultra Enterpise 1 and 2 server architecture. Sun Microsysterns, Inc., Mountain View, CA.
|
| |
30
|
|
 |
31
|
T. von Eicken , A. Basu , V. Buch , W. Vogels, U-Net: a user-level network interface for parallel and distributed computing (includes URL), Proceedings of the fifteenth ACM symposium on Operating systems principles, p.40-53, December 03-06, 1995, Copper Mountain, Colorado, United States
|
| |
32
|
|
 |
33
|
Donald Yeung , John Kubiatowicz , Anant Agarwal, MGS: a multigrain shared memory system, Proceedings of the 23rd annual international symposium on Computer architecture, p.44-55, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
34
|
Yuanyuan Zhou , Liviu Iftode , Kai Li, Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems, Proceedings of the second USENIX symposium on Operating systems design and implementation, p.75-88, October 29-November 01, 1996, Seattle, Washington, United States
|
REVIEW
"Farnaz Mounes-Toussi : Reviewer"
The paper describes a distributed shared-memory
multiprocessor system in which each node is a multiprocessor
with hardware support for cache coherence. Nodes are connected
through a Local Area Network and cache coherence is supported
by soft
more...
|