|
ABSTRACT
Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem has non-trivial dependence patterns which limit parallelism, and local computations require large matrices in order to achieve good single processor performance. We present an alternative programming model for this type of problem, which combines UPC's global address space with lightweight multithreading. We introduce the concept of memory-constrained lookahead where the amount of concurrency managed by each processor is controlled by the amount of memory available. We implement novel techniques for steering the computation to optimize for high performance and demonstrate the scalability and portability of UPC with Teraflop level performance on some machines, comparing favourably to other state-of-the-art MPI codes.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Allen E., Chase D., Hallett J., Luchangco V., Maessen J.-W., Ryu S., Steele G. L., and Tobin-Hochstadt S. The Fortress Language Specification, 2007. Available at http://research.sun.com/projects/plrg/Publications/index.html
|
| |
3
|
Bailey D. Little's Law and High Performance Computing, 1997. Available at: http://crd.lbl.gov/~dhbailey/dhbpapers/little.pdf
|
 |
4
|
|
| |
5
|
|
| |
6
|
|
| |
7
|
Bonachea D. Proposal for Extending the UPC Memory Copy Library Functions and Supporting Extensions to GASNet, v1.0. Lawrence Berkeley National Laboratory Technical Report LBNL-54983, 2004.
|
| |
8
|
Buttari A., Dongarra J., Kurzak J., Langou J., Luszczek P., and Tomov S. The Impact of Multicore on Math Software. In Proceedings of PARA 2006, Umeå, Sweden, June 2006.
|
| |
9
|
Buttari A., Langou J., Kurzak J., and Dongarra J. Parallel Tiled QR Factorization for Multicore Architectures. Technical Report UT-CS-07-598, University of Tennessee, Computer Science Department, July 2007. Also published as LAPACK Working Note 190.
|
| |
10
|
Callahan D., Chamberlain B. L., and Zima, H. P. The Cascade High Productivity Language. In Proceedings of the 9th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2004), 52--60, IEEE Computer Society, 2004.
|
 |
11
|
|
| |
12
|
Jaeyoung Choi , Jack J. Dongarra , L. Susan Ostrouchov , Antoine P. Petitet , David W. Walker , R. Clint Whaley, Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines, Scientific Programming, v.5 n.3, p.173-184, Fall 1996
|
| |
13
|
Ebcioglu K., Saraswat V., and Sarkar, V. X10: an Experimental Language for High Productivity Programming of Scalable Systems. In Proceedings of the P-PHEC 2005 Workshop, held in conjunction with HPCA 2005, 2005.
|
| |
14
|
|
| |
15
|
|
| |
16
|
Golub G. and Van Loan, C. Matrix Computations. Johns Hopkins University Press, 1996.
|
| |
17
|
Goto K. and van de Geijn R. On reducing TLB misses in matrix multiplication. Technical Report TR-2002-55, The University of Texas at Austin, Department of Computer Sciences, 2002. Also published as FLAME Working Note #9.
|
| |
18
|
|
| |
19
|
|
| |
20
|
HPC Challenge Benchmark Page. Available at: http://icl.cs.utk.edu/hpcc/
|
 |
21
|
|
| |
22
|
Kurzak J. and Dongarra J. Implementing Linear Algebra Routines on Multi-Core Processors with Pipelining and a Look Ahead. Technical Report UT-CS-06-581, University of Tennessee, Computer Science Department, 2006. Also published as LAPACK Working Note 178.
|
| |
23
|
Lewis B. and Richards K. LU Factorization Case Study Using FAST: Dataflow Parallelism with the Forte Application Scalability Tool. 2003. Available at: http://developers.sun.com/prodtech/cc/articles/FAST/lu_content.html
|
 |
24
|
|
| |
25
|
Lin H.-Y. and Luszczek P. Tuning LINPACK N*N for PARISC Platforms. Presented at the 2001 High Performance Computing on Hewlett-Packard Systems Conference, 2001.
|
 |
26
|
Piotr R Luszczek , David H Bailey , Jack J Dongarra , Jeremy Kepner , Robert F Lucas , Rolf Rabenseifner , Daisuke Takahashi, The HPC Challenge (HPCC) benchmark suite, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, November 11-17, 2006, Tampa, Florida
[doi> 10.1145/1188455.1188677]
|
 |
27
|
|
 |
28
|
|
| |
29
|
Panziera J.-P. and Baron J. A Highly Efficient Linpack Implementation Based on Shared-Memory Parallelism. In Proceedings of the 2005 International Supercomputer Conference, 2005.
|
 |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
The Top 500 Supercomputer Sites. Available at: http://www.top500.org, 2007.
|
| |
34
|
UPC Consortium. UPC Language Specifications, v1.2. Available at: http://upc.lbl.gov/docs/user/upc_spec_1.2.pdf, 2005.
|
CITED BY 2
|
|
Michael Kistler , John Gunnels , Daniel Brokenshire , Brad Benton, Petascale computing with accelerators, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
|
|
|
|