ACM Home Page
Please provide us with feedback. Feedback
Performance without pain = productivity: data layout and collective communication in UPC
Full text PdfPdf (320 KB)
Source
Principles and Practice of Parallel Programming archive
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming table of contents
Salt Lake City, UT, USA
SESSION: Programming model extensions table of contents
Pages 99-110  
Year of Publication: 2008
ISBN:978-1-59593-795-7
Authors
Rajesh Nishtala  UC Berkeley, Berkeley, CA, USA
George Almasi  IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Calin Cascaval  IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Sponsors
SIGPLAN: ACM Special Interest Group on Programming Languages
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 166,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1345206.1345224
What is a DOI?

ABSTRACT

The next generations of supercomputers are projected to have hundreds of thousands of processors. However, as the numbers of processors grow, the scalability of applications will be the dominant challenge. This forces us to reexamine some of our fundamental ways that we approach the design and use of parallel languages and runtime systems.

In this paper we show how the globally shared arrays in a popular Partitioned Global Address Space (PGAS) language, Unified Parallel C (UPC), can be combined with a new collective interface to improve both performance and scalability. This interface allows subsets, or teams, of threads to perform a collective together. As opposed to MPI's communicators, our interface allows set of threads to be placed in teams instantly rather than explicitly constructing communicators, thus allowing for a more dynamic team construction and manipulation.

We motivate our ideas with three application kernels: Dense Matrix Multiplication, Dense Cholesky factorization and multidimensional Fourier transforms. We describe how the three aforementioned applications can be succinctly written in UPC thereby aiding productivity. We also show how such an interface allows for scalability by running on up to 16,384 processors on the Blue-Gene/L. In a few lines of UPC code, we wrote a dense matrix multiply routine achieves 28.8 TFlop/s and a 3D FFT that achieves 2.1 TFlop/s. We analyze our performance results through models and show that the machine resources rather than the interfaces themselves limit the performance.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
The cascade high productivity language. hips, 00:52--60, 2004.
 
2
 
3
E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessen, S. Ryu, G. L. Steele Jr., and S. Tobin-Hochstadt. The Fortress Language Specification. Sun Microsystems, Inc., 1.0? edition, Sept. 2006.
 
4
G. Almasi, C. Archer, J. G. C. nos, J. A. Gunnels, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. SteinmacherBurow, W. Gropp, and B. Toonen. Design and implementation of message-passing services for the Blue Gene/L supercomputer. IBM Journal of Research and Development, 49(2/3):393--406, March/May 2005. Available at http://www.research.ibm.com/journal/rd49-23.html.
5
 
6
P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur, and W. Gropp. Nonuniformly communicating noncontiguous data: A case study with petsc and mpi. In IEEE Parallel and Distributed Processing Symposium (IPDPS), 2006.
 
7
S. Balay, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 2.1.5, Argonne National Laboratory, 2004.
8
 
9
C. Barton, C. Cascaval, G. Almasi, R. Garg, and J. N. Amaral. Multidimensional blocking in UPC. Technical Report RC24305, IBM, July 2007.
 
10
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing bandwidth limited problems using one-sided communication and overlap. In The 20th Int'l Parallel and Distributed Processing Symposium (IPDPS), 2006.
 
11
The Berkeley UPC Compiler, 2002. http://upc.lbl.gov.
 
12
BLAS Home Page. http://www.netlib.org/blas/.
 
13
 
14
 
15
F. Cantonnet, Y. Yao, M. Zahran, and T. El-Ghazawi. Productivity Analysis of the UPC Language. In IPDPS, 2004.
 
16
17
 
18
19
20
 
21
K. Datta, D. Bonachea, and K. Yelick. Titanium performance and potential: an NPB experimental study. In Proc. of Languages and Compilers for Parallel Computing, 2005.
 
22
 
23
ESSL User Guide. http://www-03.ibm.com/systems/p/software/essl.html.
 
24
L. S. B. et al. ScaLAPACK: a linear algebra library for message passing computers. In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing (Minneapolis, MN, 1997), page 15 (electronic), Philadelphia, PA, USA, 1997. Society for Industrial and Applied Mathematics.
 
25
M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005. special issue on "Program Generation, Optimization, and Platform Adaptation".
 
26
A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, P. Coteus, M. Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-burow, T. Takken, and P. Vranas. Overview of the BlueGene/L system architecture. IBM Journal of Research and Development, 49(2/3):195--212, 2005.
 
27
High Performance Fortran Forum. High Performance Fortran language specification, version 1.0. Technical Report CRPCTR92225, Houston, Tex., 1993.
 
28
 
29
HPL Algorithm description. http://www.netlib.org/benchmark/hpl/algorithm.html.
 
30
Intel Math Kernel Library Reference Manual. http://www.intel.com/software/products/mkl/techtopics/mklman52.pdf.
 
31
T. MathWorks. Using matlab, 1997.
 
32
Message Passing Interface. http://www.mpiforum.org/docs/docs.html.
 
33
J. C. Nash. "The Cholesky Decomposition." In Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation, chapter 7, pages 84--93. Bristol, England: Adam Hilger, 2nd edition, 1990.
34
35
 
36
OpenMP. Simple, portable, scalable SMP programming. http://www.openmp.org/, 2000.
 
37
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Ga?ci?, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.
 
38
Y. Qian and A. Afsahi. Efficient rdma-based multi-port collectives on multi-rail qsnetii clusters. In The 6th Workshop on Communication Architecture for Clusters (CAC 2006), In Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006.
 
39
Z. Ryne and S. Seidel. A specification of the extensions to the collective operations of unified parallel c. Technical Report Technical Report 05-08, Michigan Technological University, Department of Computer Science, 2005.
 
40
M. J. Sottile, C. E. Rasmussen, and R. L. Graham. Co-array collectives: Refined semantics for co-array fortran. In V. N. Alexandrov, G. D. van Albada, P. M. A. Sloot, and J. Dongarra, editors, International Conference on Computational Science (2), volume 3992 of Lecture Notes in Computer Science, pages 945--952. Springer, 2006.
 
41
UPC Language Specification, V1.2, May 2005.
 
42
S. S. Vadhiyar, G. E. Fagg, and J. J. Dongarra. Performance modeling for self adapting collective communications for mpi. In LACSI Symposium, 2001.
 
43
 
44
R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing.
 
45
R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1--2):3--35, 2001.
 
46
The X10 programming language. http://x10.sourceforge.net, 2004.
 
47
K. Yelick. Keynote: Compilation techniques for partitioned global address space languages. In The 19th International Workshop on Languages and Compilers for Parallel Computing, 2006.
 
48
K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance java dialect. Concurrency: Practice and Experience, 10(11-13), September-November 1998

Collaborative Colleagues:
Rajesh Nishtala: colleagues
George Almasi: colleagues
Calin Cascaval: colleagues