|
ABSTRACT
The next generations of supercomputers are projected to have hundreds of thousands of processors. However, as the numbers of processors grow, the scalability of applications will be the dominant challenge. This forces us to reexamine some of our fundamental ways that we approach the design and use of parallel languages and runtime systems. In this paper we show how the globally shared arrays in a popular Partitioned Global Address Space (PGAS) language, Unified Parallel C (UPC), can be combined with a new collective interface to improve both performance and scalability. This interface allows subsets, or teams, of threads to perform a collective together. As opposed to MPI's communicators, our interface allows set of threads to be placed in teams instantly rather than explicitly constructing communicators, thus allowing for a more dynamic team construction and manipulation. We motivate our ideas with three application kernels: Dense Matrix Multiplication, Dense Cholesky factorization and multidimensional Fourier transforms. We describe how the three aforementioned applications can be succinctly written in UPC thereby aiding productivity. We also show how such an interface allows for scalability by running on up to 16,384 processors on the Blue-Gene/L. In a few lines of UPC code, we wrote a dense matrix multiply routine achieves 28.8 TFlop/s and a 3D FFT that achieves 2.1 TFlop/s. We analyze our performance results through models and show that the machine resources rather than the interfaces themselves limit the performance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
The cascade high productivity language. hips, 00:52--60, 2004.
|
| |
2
|
|
| |
3
|
E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessen, S. Ryu, G. L. Steele Jr., and S. Tobin-Hochstadt. The Fortress Language Specification. Sun Microsystems, Inc., 1.0? edition, Sept. 2006.
|
| |
4
|
G. Almasi, C. Archer, J. G. C. nos, J. A. Gunnels, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. SteinmacherBurow, W. Gropp, and B. Toonen. Design and implementation of message-passing services for the Blue Gene/L supercomputer. IBM Journal of Research and Development, 49(2/3):393--406, March/May 2005. Available at http://www.research.ibm.com/journal/rd49-23.html.
|
 |
5
|
George Almási , Philip Heidelberger , Charles J. Archer , Xavier Martorell , C. Chris Erway , José E. Moreira , B. Steinmacher-Burow , Yili Zheng, Optimization of MPI collective communication on BlueGene/L systems, Proceedings of the 19th annual international conference on Supercomputing, June 20-22, 2005, Cambridge, Massachusetts
[doi> 10.1145/1088149.1088183]
|
| |
6
|
P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur, and W. Gropp. Nonuniformly communicating noncontiguous data: A case study with petsc and mpi. In IEEE Parallel and Distributed Processing Symposium (IPDPS), 2006.
|
| |
7
|
S. Balay, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 2.1.5, Argonne National Laboratory, 2004.
|
 |
8
|
Christopher Barton , CĆlin Casçaval , George Almási , Yili Zheng , Montse Farreras , Siddhartha Chatterje , José Nelson Amaral, Shared memory programming for large scale machines, Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, June 11-14, 2006, Ottawa, Ontario, Canada
|
| |
9
|
C. Barton, C. Cascaval, G. Almasi, R. Garg, and J. N. Amaral. Multidimensional blocking in UPC. Technical Report RC24305, IBM, July 2007.
|
| |
10
|
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing bandwidth limited problems using one-sided communication and overlap. In The 20th Int'l Parallel and Distributed Processing Symposium (IPDPS), 2006.
|
| |
11
|
The Berkeley UPC Compiler, 2002. http://upc.lbl.gov.
|
| |
12
|
BLAS Home Page. http://www.netlib.org/blas/.
|
| |
13
|
|
| |
14
|
|
| |
15
|
F. Cantonnet, Y. Yao, M. Zahran, and T. El-Ghazawi. Productivity Analysis of the UPC Language. In IPDPS, 2004.
|
| |
16
|
Bradford L. Chamberlain , Sung-Eun Choi , E. Christopher Lewis , Calvin Lin , Lawrence Snyder , W. Derrick Weathersby, ZPL: A Machine Independent Programming Language for Parallel Computers, IEEE Transactions on Software Engineering, v.26 n.3, p.197-211, March 2000
[doi> 10.1109/32.842947]
|
 |
17
|
|
| |
18
|
Jaeyoung Choi , Jack Dongarra , Susan Ostrouchov , Antoine Petitet , David W. Walker , R. Clinton Whaley, A Proposal for a Set of Parallel Basic Linear Algebra Subprograms, Proceedings of the Second International Workshop on Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science, p.107-114, August 21-24, 1995
|
 |
19
|
Cristian Coarfa , Yuri Dotsenko , John Mellor-Crummey , François Cantonnet , Tarek El-Ghazawi , Ashrujit Mohanti , Yiyi Yao , Daniel Chavarría-Miranda, An evaluation of global address space languages: co-array fortran and unified parallel C, Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 15-17, 2005, Chicago, IL, USA
[doi> 10.1145/1065944.1065950]
|
 |
20
|
David Culler , Richard Karp , David Patterson , Abhijit Sahay , Klaus Erik Schauser , Eunice Santos , Ramesh Subramonian , Thorsten von Eicken, LogP: towards a realistic model of parallel computation, Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.1-12, May 19-22, 1993, San Diego, California, United States
|
| |
21
|
K. Datta, D. Bonachea, and K. Yelick. Titanium performance and potential: an NPB experimental study. In Proc. of Languages and Compilers for Parallel Computing, 2005.
|
| |
22
|
|
| |
23
|
ESSL User Guide. http://www-03.ibm.com/systems/p/software/essl.html.
|
| |
24
|
L. S. B. et al. ScaLAPACK: a linear algebra library for message passing computers. In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing (Minneapolis, MN, 1997), page 15 (electronic), Philadelphia, PA, USA, 1997. Society for Industrial and Applied Mathematics.
|
| |
25
|
M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005. special issue on "Program Generation, Optimization, and Platform Adaptation".
|
| |
26
|
A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, P. Coteus, M. Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-burow, T. Takken, and P. Vranas. Overview of the BlueGene/L system architecture. IBM Journal of Research and Development, 49(2/3):195--212, 2005.
|
| |
27
|
High Performance Fortran Forum. High Performance Fortran language specification, version 1.0. Technical Report CRPCTR92225, Houston, Tex., 1993.
|
| |
28
|
Paul N. Hilfinger , Dan Bonachea , David Gay , Susan Graham , Ben Liblit , Geoff Pike , Katherine Yelick, Titanium Language Reference Manual, University of California at Berkeley, Berkeley, CA, 2001
|
| |
29
|
HPL Algorithm description. http://www.netlib.org/benchmark/hpl/algorithm.html.
|
| |
30
|
Intel Math Kernel Library Reference Manual. http://www.intel.com/software/products/mkl/techtopics/mklman52.pdf.
|
| |
31
|
T. MathWorks. Using matlab, 1997.
|
| |
32
|
Message Passing Interface. http://www.mpiforum.org/docs/docs.html.
|
| |
33
|
J. C. Nash. "The Cholesky Decomposition." In Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation, chapter 7, pages 84--93. Bristol, England: Adam Hilger, 2nd edition, 1990.
|
 |
34
|
|
 |
35
|
|
| |
36
|
OpenMP. Simple, portable, scalable SMP programming. http://www.openmp.org/, 2000.
|
| |
37
|
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Ga?ci?, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.
|
| |
38
|
Y. Qian and A. Afsahi. Efficient rdma-based multi-port collectives on multi-rail qsnetii clusters. In The 6th Workshop on Communication Architecture for Clusters (CAC 2006), In Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006.
|
| |
39
|
Z. Ryne and S. Seidel. A specification of the extensions to the collective operations of unified parallel c. Technical Report Technical Report 05-08, Michigan Technological University, Department of Computer Science, 2005.
|
| |
40
|
M. J. Sottile, C. E. Rasmussen, and R. L. Graham. Co-array collectives: Refined semantics for co-array fortran. In V. N. Alexandrov, G. D. van Albada, P. M. A. Sloot, and J. Dongarra, editors, International Conference on Computational Science (2), volume 3992 of Lecture Notes in Computer Science, pages 945--952. Springer, 2006.
|
| |
41
|
UPC Language Specification, V1.2, May 2005.
|
| |
42
|
S. S. Vadhiyar, G. E. Fagg, and J. J. Dongarra. Performance modeling for self adapting collective communications for mpi. In LACSI Symposium, 2001.
|
| |
43
|
|
| |
44
|
R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing.
|
| |
45
|
R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1--2):3--35, 2001.
|
| |
46
|
The X10 programming language. http://x10.sourceforge.net, 2004.
|
| |
47
|
K. Yelick. Keynote: Compilation techniques for partitioned global address space languages. In The 19th International Workshop on Languages and Compilers for Parallel Computing, 2006.
|
| |
48
|
K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance java dialect. Concurrency: Practice and Experience, 10(11-13), September-November 1998
|
|