|
ABSTRACT
Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have been well explored. Recently, a number of researchers have presented architectural techniques for scaling a cache coherent shared address space to much larger processor counts. In this paper, we examine the extent to which applications can achieve reasonable performance on such large-scale, cache-coherent, distributed shared address space machines, by determining the problems sizes needed to achieve a reasonable level of efficiency. We also look at how much programming effort and optimization is needed to achieve high efficiency, beyond that needed at small processor counts. For each application, we discuss the main architectural bottlenecks that prevent smaller problem sizes or less optimized programs from achieving good efficiency. Our results show that while there are some applications that either do not scale or must be heavily optimized to do so, for most of the applications we studied it is not necessary to heavily modify the code or restructure algorithms to scale well upto several hundred processors, once the basic techniques for load balancing and data locality are used that are needed for small-scale systems as well. Programs written with some care perform well without substantially compromising the ease of programming advantage of a shared address space, and the problem sizes required to achieve good performance are surprisingly small. It is important to be careful about how data structures and layouts interact with system granularities, but these optimizations are usually needed for moderate-scale machines as well.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
ABC+95
|
Anant Agarwal , Ricardo Bianchini , David Chaiken , Kirk L. Johnson , David Kranz , John Kubiatowicz , Beng-Hong Lim , Kenneth Mackenzie , Donald Yeung, The MIT Alewife machine: architecture and performance, Proceedings of the 22nd annual international symposium on Computer architecture, p.2-13, June 22-24, 1995, S. Margherita Ligure, Italy
|
| |
Convex93
|
CONVEX Computer Corporation. "Exemplar Architecture Manual". Richardson, TX, 1993.
|
| |
Golds93
|
|
| |
HHS+95
|
Chris Holt , Mark Heinrich , Jaswinder P Singh , Edward Rothberg , John Hennessy, The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors, Stanford University, Stanford, CA, 1995
|
 |
HKO+94
|
Mark Heinrich , Jeffrey Kuskin , David Ofelt , John Heinlein , Joel Baxter , Jaswinder Pal Singh , Richard Simoni , Kourosh Gharachorloo , David Nakahira , Mark Horowitz , Anoop Gupta , Mendel Rosenblum , John Hennessy, The performance impact of flexibility in the Stanford FLASH multiprocessor, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.274-285, October 05-07, 1994, San Jose, California, United States
|
| |
HS94
|
Chris Holt and Jaswinder Pal Singh. Hierarchical N-Body Methods on Shared Address Space Multiprocessors. SlAM Conference on Parallel Processing for Scientific Computing, February 1995.
|
 |
KOH+94
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
| |
KSR92
|
Kendall Square Research. KSR1 Technical Summary. Waltham, MA, 1992.
|
| |
LLG+92
|
Daniel Lenoski , James Laudon , Kourosh Gharachorloo , Wolf-Dietrich Weber , Anoop Gupta , John Hennessy , Mark Horowitz , Monica S. Lam, The Stanford Dash Multiprocessor, Computer, v.25 n.3, p.63-79, March 1992
[doi> 10.1109/2.121510]
|
 |
LLJ+92
|
Daniel Lenoski , James Laudon , Truman Joe , David Nakahira , Luis Stevens , Anoop Gupta , John Hennessy, The DASH prototype: implementation and performance, Proceedings of the 19th annual international symposium on Computer architecture, p.92-103, May 19-21, 1992, Queensland, Australia
|
| |
MG91
|
|
 |
RLW94
|
S. K. Reinhardt , J. R. Larus , D. A. Wood, Tempest and typhoon: user-level shared memory, Proceedings of the 21ST annual international symposium on Computer architecture, p.325-336, April 18-21, 1994, Chicago, Illinois, United States
|
 |
RSG93
|
Edward Rothberg , Jaswinder Pal Singh , Anoop Gupta, Working sets, cache sizes, and node granularity issues for large-scale multiprocessors, Proceedings of the 20th annual international symposium on Computer architecture, p.14-26, May 16-19, 1993, San Diego, California, United States
|
| |
Singh93
|
Jaswinder Pal Singh. Hierarchical N-body Methods and Their Implications for Multiprocessors. Ph.D. Thesis, Stanford University, February 1993.
|
 |
SFL+94
|
Ioannis Schoinas , Babak Falsafi , Alvin R. Lebeck , Steven K. Reinhardt , James R. Larus , David A. Wood, Fine-grain access control for distributed shared memory, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.297-306, October 05-07, 1994, San Jose, California, United States
|
| |
SHG93
|
|
| |
SJH+93
|
Jaswinder Pal Singh, Truman Joe, John L. Hennessy, and Anoop Gupta. An Empirical Comparison of the KSR-1 ALLCACHE and Stanford DASH Multiprocessors. Supercomputing '93, November 1993.
|
| |
SWG+95
|
Jaswinder Pal Singh et al. The SPLASH-2 Suite of Parallel Applications, Technical Report to appear, Stanford University.
|
 |
WSH94
|
Steven Cameron Woo , Jaswinder Pal Singh , John L. Hennessy, The performance advantages of integrating block data transfer in cache-coherent multiprocessors, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.219-229, October 05-07, 1994, San Jose, California, United States
|
CITED BY 13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dongming Jiang , Brian O'Kelley , Xiang Yu , Sanjeev Kumar , Angelos Bilas , Jaswinder Pal Singh, Application scaling under shared virtual memory on a cluster of SMPs, Proceedings of the 13th international conference on Supercomputing, p.165-174, June 20-25, 1999, Rhodes, Greece
|
|
|
Dimitrios S. Nikolopoulos , Theodore S. Papatheodorou , Constantine D. Polychronopoulos , Jesus Labarta , Eduard Ayguade;eacute;, Is data distribution necessary in OpenMP?, Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), p.47-es, November 04-10, 2000, Dallas, Texas, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|