|
ABSTRACT
The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardware-coherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity.This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node memory system nor code instrumentation to identify memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechanisms are simple and do not require programmability.We find that the performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardware-coherent shared memory for many applications, and we show the value of each of the mechanisms in different applications. Application performance improves by about 37% on average for reasonably well performing applications, even on our relatively slow programmable NI, and more for others. We discuss the key remaining bottlenecks at the protocol level and use a firmware performance monitor in the NI to understand the interactions with and the implications for the communication layer.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Soichiro Araki , Angelos Bilas , Cezary Dubnicki , Jan Edler , Koichi Konishi , James Philbin, User-space communication: a quantitative study, Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), p.1-16, November 07-13, 1998, San Jose, CA
|
| |
2
|
|
| |
3
|
J. E. Barnes and P. Hut. A hierarchical O(N log N) force calculation algorithm. Nature, 324(4):446--449, 1986.
|
 |
4
|
T. von Eicken , A. Basu , V. Buch , W. Vogels, U-Net: a user-level network interface for parallel and distributed computing (includes URL), Proceedings of the fifteenth ACM symposium on Operating systems principles, p.40-53, December 03-06, 1995, Copper Mountain, Colorado, United States
|
 |
5
|
J. K. Bennett , J. B. Carter , W. Zwaenepoel, Munin: distributed shared memory based on type-specific memory coherence, Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming, p.168-176, March 14-16, 1990, Seattle, Washington, United States
|
 |
6
|
R. Bianchini , L. I. Kontothanassis , R. Pinto , M. De Maria , M. Abud , C. L. Amorim, Hiding communication latency and coherence overhead in software DSMs, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.198-209, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
7
|
A. Bilas, L. Iftode, R. Samanta, and J. P. Singh. Supporting a coherent shared address space across SMP nodes: An application-driven investigation. In IMA Workshop on Parallel Algorithms and Parallel System.s, Nov. 1996.
|
 |
8
|
|
| |
9
|
|
 |
10
|
|
 |
11
|
Guy E. Blelloch , Charles E. Leiserson , Bruce M. Maggs , C. Greg Plaxton , Stephen J. Smith , Marco Zagha, A comparison of sorting algorithms for the connection machine CM-2, Proceedings of the third annual ACM symposium on Parallel algorithms and architectures, p.3-16, July 21-24, 1991, Hilton Head, South Carolina, United States
[doi> 10.1145/113379.113380]
|
 |
12
|
M. A. Blumrich , K. Li , R. Alpert , C. Dubnicki , E. W. Felten , J. Sandberg, Virtual memory mapped network interface for the SHRIMP multicomputer, Proceedings of the 21ST annual international symposium on Computer architecture, p.142-153, April 18-21, 1994, Chicago, Illinois, United States
|
| |
13
|
Nanette J. Boden , Danny Cohen , Robert E. Felderman , Alan E. Kulawik , Charles L. Seitz , Jakov N. Seizovic , Wen-King Su, Myrinet: A Gigabit-per-Second Local Area Network, IEEE Micro, v.15 n.1, p.29-36, February 1995
[doi> 10.1109/40.342015]
|
| |
14
|
A. Brandt. Multi-level adaptive solutions to boundary-value problems. Mathematics of Computation, 31(138):333-390, April 1977.
|
| |
15
|
C. Dubnicki, A. Bilas, Y. Chen, S. Damianakis, and K. Li. VMMO-2: efficient support for reliable, connection-oriented communication. In Proceedings of Hot Interconnects, Aug. 1997.
|
| |
16
|
D. Dunning and G. Regnier. The Virtual Interface Architecture. In Proceedings of Hot Interconnects V Symposium, Stanford, Aug. 1997.
|
 |
17
|
Thorsten von Eicken , David E. Culler , Seth Copen Goldstein , Klaus Erik Schauser, Active messages: a mechanism for integrated communication and computation, Proceedings of the 19th annual international symposium on Computer architecture, p.256-266, May 19-21, 1992, Queensland, Australia
|
 |
18
|
Andrew Erlichson , Neal Nuckolls , Greg Chesson , John Hennessy, SoftFLASH: analyzing the performance of clustered distributed virtual shared memory, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.210-220, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
19
|
|
| |
20
|
|
| |
21
|
N. Hardavellas, G. C. Hunt, S. Ioannidis, R. Stets, S. Dwarkadas, L. Kontothanassis, and M. L. Scott. Efficient use of memory-mapped network interfaces for shared memory computing. Newsletter of the IEEE CS Technical Committee on Computer Architecture, pages 28-33, Mar. t997.
|
| |
22
|
L. Hernquist. Hierarchical N-body methods. Computer Physics Communications, 48:107-115, 1988.
|
| |
23
|
C. Holt, J. P. Singh, and J. Hennessy. Architectural and application bottlenecks in scalable DSM multiprocessors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.
|
| |
24
|
R. W. Horst and D. Garcia. ServerNet SAN I/O Architecture. In Proceedings of Hot Interconnects V Symposium, Stanford, Aug. 1997.
|
| |
25
|
L. Iftode, M. BIumrich, C. Dubnicki, D. Oppenheimer, J. P. Singh, and K. Li. Implementation and performance of shared virtual memory protocols on shrimp. In Seventh Workshop on Scalable Shared Memory Muftiprocessors (in conjunction with the 25th Annual International Symposium on Computer Architecture), June 1998.
|
| |
26
|
|
 |
27
|
Liviu Iftode , Jaswinder Pal Singh , Kai Li, Understanding application performance on shared virtual memory systems, Proceedings of the 23rd annual international symposium on Computer architecture, p.122-133, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
28
|
Dongming Jiang , Hongzhang Shan , Jaswinder Pal Singh, Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors, Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.217-229, June 18-21, 1997, Las Vegas, Nevada, United States
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
| |
32
|
P. Keleher, A. Cox, S. Dwarkadas, and W. Zwaenepoel. qYeadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the Winter USENIX Conference, pages 115-132, Jan. I994.
|
 |
33
|
Leonidas Kontothanassis , Galen Hunt , Robert Stets , Nikolaos Hardavellas , Michał Cierniak , Srinivasan Parthasarathy , Wagner Meira, Jr. , Sandhya Dwarkadas , Michael Scott, VM-based shared memory on low-latency, remote-memory-access networks, Proceedings of the 24th annual international symposium on Computer architecture, p.157-169, June 01-04, 1997, Denver, Colorado, United States
|
| |
34
|
|
 |
35
|
|
 |
36
|
Cheng Liao , Margaret Martonosi , Douglas W. Clark, Performance monitoring in a Myrinet-connected SHRIMP cluster, Proceedings of the SIGMETRICS symposium on Parallel and distributed tools, p.21-29, August 03-04, 1998, Welches, Oregon, United States
[doi> 10.1145/281035.281038]
|
 |
37
|
|
 |
38
|
Scott Pakin , Mario Lauria , Andrew Chien, High performance messaging on workstations: Illinois fast messages (FM) for Myrinet, Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), p.55-es, December 04-08, 1995, San Diego, California, United States
[doi> 10.1145/224170.224360]
|
| |
39
|
L. Prylli and B. 'Iburancheau. BIP: a new protocol designed for high performance. In In PC-NOW Workshop, held in parallel with IPPS/SPDP98, Orlando, USA, March 30 - April 3 1998.
|
| |
40
|
|
| |
41
|
|
| |
42
|
I. Schoinas, B. Falsafi, M. D. Hill, J. R. Larus, C. E. Lucas, S. S. Mukherjee, S. K. Reinhardt, E. Schnarr, and D. A. Wood. Implementing fine-grain distributed shared memory on commodity stop workstations. Technical Report 1307, University of Wisconsin-Madison, Mar. 1996.
|
| |
43
|
J. P. Singh, A. Gupta, and J. L. Hennessy. Implications of hierarchical N-body techniques for multiprocessor architecture. A GM Transactions on Computer Systerr~, May 1995. To appear. Early version available as Stanford Univeristy Tech. Report no. CSL-TR-92-506~ January 1992.
|
| |
44
|
|
| |
45
|
|
 |
46
|
Robert Stets , Sandhya Dwarkadas , Nikolaos Hardavellas , Galen Hunt , Leonidas Kontothanassis , Srinivasan Parthasarathy , Michael Scott, Cashmere-2L: software coherent shared memory on a clustered remote-write network, Proceedings of the sixteenth ACM symposium on Operating systems principles, p.170-183, October 05-08, 1997, Saint Malo, France
|
| |
47
|
H. Tezuka. A. Hori, and Y. Ishikawa. PM: a highperformance communication library for multi-user parallel environments. Technical Report TR-96015, Real World Computing Partnership, 1996.
|
| |
48
|
S. Woo, M. Ohara, E. 'Ibrrie, J. P. Singh, and A. Gupta. Methodological considerations and characterization of the SPLASH-2 parallel application suite. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May I995.
|
 |
49
|
Steven Cameron Woo , Jaswinder Pal Singh , John L. Hennessy, The performance advantages of integrating block data transfer in cache-coherent multiprocessors, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.219-229, October 05-07, 1994, San Jose, California, United States
|
 |
50
|
Donald Yeung , John Kubiatowicz , Anant Agarwal, MGS: a multigrain shared memory system, Proceedings of the 23rd annual international symposium on Computer architecture, p.44-55, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
51
|
Yuanyuan Zhou , Liviu Iftode , Kai Li, Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems, Proceedings of the second USENIX symposium on Operating systems design and implementation, p.75-88, October 29-November 01, 1996, Seattle, Washington, United States
|
 |
52
|
Yuanyuan Zhou , Liviu Iftode , Jaswinder Pal Sing , Kai Li , Brian R. Toonen , Ioannis Schoinas , Mark D. Hill , David A. Wood, Relaxed consistency and coherence granularity in DSM systems: a performance evaluation, Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.193-205, June 18-21, 1997, Las Vegas, Nevada, United States
|
CITED BY 15
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Leonidas Kontothanassis , Robert Stets , Galen Hunt , Umit Rencuzogullari , Gautam Altekar , Sandhya Dwarkadas , Michael L. Scott, Shared memory computing on clusters with symmetric multiprocessors and system area networks, ACM Transactions on Computer Systems (TOCS), v.23 n.3, p.301-335, August 2005
|
|
|
|
|
|
|
|
|
|
|
|
Håkan Zeffer , Zoran Radović , Martin Karlsson , Erik Hagersten, TMA: a trap-based memory architecture, Proceedings of the 20th annual international conference on Supercomputing, June 28-July 01, 2006, Cairns, Queensland, Australia
|
|
|
|
|
|
|
|
|
Jaejin Lee , Sangmin Seo , Chihun Kim , Junghyun Kim , Posung Chun , Zehra Sura , Jungwon Kim , SangYong Han, COMIC: a coherent shared memory interface for cell be, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
|
|
|
|
|