|
ABSTRACT
Clusters of symmetric multiprocessors (SMPs) are important platforms for high-performance computing. With the success of hardware cache-coherent distributed shared memory (DSM), a lot of effort has also been made to support the coherent shared-address-space programming model in software on clusters. Much research has been done in fast communication on clusters and in protocols for supporting software shared memory across them. However, the performance of software virtual memory (SVM) is still far from that achieved on hardware DSM systems. The goal of this paper is to improve the performance of SVM on system area network clusters by considering communication and protocol layer interactions. We first examine what are the important communication system bottlenecks that stand in the way of improving parallel performance of SVM clusters; in particular, which parameters of the communication architecture are most important to improve further relative to processor speed, which ones are already adequate on modern systems for most applications, and how will this change with technology in the future. We find that the most important communication subsystem cost to improve is the overhead of generating and delivery interrupts for asynchronous protocol processing. Then we proceed to show, that by providing simple and general support for asynchronous message handling in a commodity network interface (NI) and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling, and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support for shared Memory Abstractions), on a cluster of SMPs with a programmable NI. We find that the performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardware-coherent shared memory for many applications, and we show the value of each of the mechanisms in different applications.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Soichiro Araki , Angelos Bilas , Cezary Dubnicki , Jan Edler , Koichi Konishi , James Philbin, User-space communication: a quantitative study, Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), p.1-16, November 07-13, 1998, San Jose, CA
|
| |
2
|
|
| |
3
|
BARNES,J.AND HUT, P. 1986. A hierarchical O(NlogN) force calculation algorithm. Nature 324, 4, 446-449.
|
 |
4
|
R. Bianchini , L. I. Kontothanassis , R. Pinto , M. De Maria , M. Abud , C. L. Amorim, Hiding communication latency and coherence overhead in software DSMs, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.198-209, October 01-04, 1996, Cambridge, Massachusetts, United States
|
 |
5
|
|
 |
6
|
|
| |
7
|
BILAS, A., IFTODE, L., AND SINGH, J. P. 1999a. Supporting a coherent shared address space across SMP nodes: An application-driven investigation. In Algorithms for Parallel Processing, M. Heath, A. Ranade, and R. Schreiber, Eds. IMA Volumes in Mathematics and Its Applications, vol. 105. Springer-Verlag, Vienna, Austria, 19-59.
|
| |
8
|
|
| |
9
|
BILAS, A., LIAO, C., AND SINGH, J. P. 1999c. Accelerating shared virtual memory using commodity ni support to avoid asynchronous message handling. In Proceedings of the 26th Annual International Symposium on Computer Architecture (June).
|
 |
10
|
Guy E. Blelloch , Charles E. Leiserson , Bruce M. Maggs , C. Greg Plaxton , Stephen J. Smith , Marco Zagha, A comparison of sorting algorithms for the connection machine CM-2, Proceedings of the third annual ACM symposium on Parallel algorithms and architectures, p.3-16, July 21-24, 1991, Hilton Head, South Carolina, United States
[doi> 10.1145/113379.113380]
|
 |
11
|
M. A. Blumrich , K. Li , R. Alpert , C. Dubnicki , E. W. Felten , J. Sandberg, Virtual memory mapped network interface for the SHRIMP multicomputer, Proceedings of the 21ST annual international symposium on Computer architecture, p.142-153, April 18-21, 1994, Chicago, Illinois, United States
|
| |
12
|
Nanette J. Boden , Danny Cohen , Robert E. Felderman , Alan E. Kulawik , Charles L. Seitz , Jakov N. Seizovic , Wen-King Su, Myrinet: A Gigabit-per-Second Local Area Network, IEEE Micro, v.15 n.1, p.29-36, February 1995
[doi> 10.1109/40.342015]
|
| |
13
|
BRANDT, A. 1977. Multi-level adaptive solutions to boundary-value problems. Math. Comput. 31, 138 (Apr.), 333-390.
|
| |
14
|
DUBNICKI, C., BILAS, A., CHEN, Y., DAMIANAKIS, S., AND LI, K. 1997. VMMC-2: Efficient support for reliable, connection-oriented communication. In Proceedings of the Symposium on Hot Interconnects V (Stamford, CT, Aug.).
|
| |
15
|
DUNNING,D.AND REGNIER, G. 1997. The virtual interface architecture. In Proceedings of the Symposium on Hot Interconnects V (Stamford, CT, Aug.).
|
 |
16
|
Thorsten von Eicken , David E. Culler , Seth Copen Goldstein , Klaus Erik Schauser, Active messages: a mechanism for integrated communication and computation, Proceedings of the 19th annual international symposium on Computer architecture, p.256-266, May 19-21, 1992, Queensland, Australia
|
 |
17
|
Andrew Erlichson , Neal Nuckolls , Greg Chesson , John Hennessy, SoftFLASH: analyzing the performance of clustered distributed virtual shared memory, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.210-220, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
18
|
|
| |
19
|
GILLETT, R., COLLINS, M., AND PIMM, D. 1996. Overview of network memory channel for PCI. In Proceedings on COMPCON (February).
|
| |
20
|
HARDAVELLAS, N., HUNT,G.C.,IONNIDIS, S., STETS, R., DWARKADAS, S., KONTOTHANASSIS, L., AND SCOTT, M. L. 1997. Efficient use of memory-mapped network interfaces for shared memory computing. In Newsletter of the IEEE CS Technical Committee on Computer Architecture (Mar.). 28-33.
|
| |
21
|
HERNQUIST, L. 1988. Hierarchical N-body methods. Comput. Phys. Commun. 48, 107-115.
|
| |
22
|
HOLT, C., HEINRICH, M., SINGH,J.P.,SINGH, A., AND HENNESSY, J. L. 1995. The effects of latency and occupancy on the performance of dsm multiprocessors. Stanford University, Stanford, CA.
|
 |
23
|
Chris Holt , Jaswinder Pal Singh , John Hennessy, Application and architectural bottlenecks in large scale distributed shared memory machines, Proceedings of the 23rd annual international symposium on Computer architecture, p.134-145, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
24
|
HORST,R.W.AND GARCIA, D. 1997. ServerNet SAN I/O architecture. In Proceedings of the Symposium on Hot Interconnects V (Stamford, CT, Aug.).
|
| |
25
|
|
 |
26
|
Liviu Iftode , Jaswinder Pal Singh , Kai Li, Understanding application performance on shared virtual memory systems, Proceedings of the 23rd annual international symposium on Computer architecture, p.122-133, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
27
|
|
 |
28
|
Dongming Jiang , Hongzhang Shan , Jaswinder Pal Singh, Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors, Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.217-229, June 18-21, 1997, Las Vegas, Nevada, United States
|
| |
29
|
|
| |
30
|
|
| |
31
|
KELEHER, P., DWARKADAS, S., COX, A., AND ZWAENEPOEL, W. 1994. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the Winter Conference on USENIX (Jan.). USENIX Assoc., Berkeley, CA, 115-131.
|
| |
32
|
|
 |
33
|
Leonidas Kontothanassis , Galen Hunt , Robert Stets , Nikolaos Hardavellas , Michał Cierniak , Srinivasan Parthasarathy , Wagner Meira, Jr. , Sandhya Dwarkadas , Michael Scott, VM-based shared memory on low-latency, remote-memory-access networks, Proceedings of the 24th annual international symposium on Computer architecture, p.157-169, June 01-04, 1997, Denver, Colorado, United States
|
 |
34
|
Leonidas I. Kontothanassis , Michael L. Scott , Ricardo Bianchini, Lazy release consistency for hardware-coherent multiprocessors, Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), p.61-es, December 04-08, 1995, San Diego, California, United States
[doi> 10.1145/224170.224398]
|
 |
35
|
|
 |
36
|
Cheng Liao , Margaret Martonosi , Douglas W. Clark, Performance monitoring in a Myrinet-connected SHRIMP cluster, Proceedings of the SIGMETRICS symposium on Parallel and distributed tools, p.21-29, August 03-04, 1998, Welches, Oregon, United States
[doi> 10.1145/281035.281038]
|
| |
37
|
Richard Martin , Amin Vahdat , David Culler , Thomas Anderson, Effect of Communication Latency, Overhead, and Bandwidth on a Cluster, University of California at Berkeley, Berkeley, CA, 1998
|
 |
38
|
|
| |
39
|
PAKIN, S., BUCHANAN, M., LAURIA, M., AND CHIEN, A. 1997. Fast Messages (FM) 2.0 streaming interface. In Proceedings of the 1997 USENIX Annual Technical Conference (Anaheim, CA, Jan.). USENIX Assoc., Berkeley, CA.
|
 |
40
|
Steven K. Reinhardt , Robert W. Pfile , David A. Wood, Decoupled hardware support for distributed shared memory, Proceedings of the 23rd annual international symposium on Computer architecture, p.34-43, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
41
|
|
| |
42
|
|
| |
43
|
SCHOINAS, I., FALSAFI, B., HILL,M.D.,LARUS,J.R.,LUCAS,C.E.,MUKHERJEE,S.S.,REINHARDT, S. K., SCHNARR, E., AND WOOD, D. A. 1996. Implementing fine-grain distributed shared memory on commodity smp workstations. 1307.
|
| |
44
|
SHARMA, A., NGUYEN,A.T.,TORELLAS, J., MICHAEL, M., AND CARBAJAL, J. 1996. Augmint: A multiprocessor simulation environment for Intel x86 architectures.
|
| |
45
|
|
| |
46
|
|
 |
47
|
|
| |
48
|
|
| |
49
|
STETS, R., DWARKADAS, S., KONTOTHANASSIS, L., RENCUZOGULLARI, U., AND SCOTT, M. L. 2000. The effect of network toral order, broadcast, and remote-write capability on network-based shared memory computing. In Proceedings of the 6th IEEE Symposium on High-Performance Computer Architecture (Jan.).
|
| |
50
|
|
| |
51
|
WOO,S.C.,OHARA, M., TORRIE, E., SINGH,J.P.,AND GUPTA, A. 1996. Methodological considerations and characterization of the SPLASH-2 parallel application suite. In Proceedings of the 23rd International Symposium on Computer Architecture (ISCA '96, Philadelphia, PA, May 22-24), J.-L. Baer, Chair. ACM Press, New York, NY.
|
 |
52
|
Yuanyuan Zhou , Liviu Iftode , Kai Li, Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems, Proceedings of the second USENIX symposium on Operating systems design and implementation, p.75-88, October 29-November 01, 1996, Seattle, Washington, United States
|
 |
53
|
Yuanyuan Zhou , Liviu Iftode , Jaswinder Pal Sing , Kai Li , Brian R. Toonen , Ioannis Schoinas , Mark D. Hill , David A. Wood, Relaxed consistency and coherence granularity in DSM systems: a performance evaluation, Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.193-205, June 18-21, 1997, Las Vegas, Nevada, United States
|
REVIEW
"Veronica Lagrange : Reviewer"
This paper describes research to improve performance
of software virtual memory (SVM) for clustered environments by targeting
potential bottlenecks on the communication and protocol layer
interactions. Host overhead, I/O bus bandw
more...
|