ACM Home Page
Please provide us with feedback. Feedback
Multi-threading and one-sided communication in parallel LU factorization
Full text PdfPdf (438 KB)
Source
Conference on High Performance Networking and Computing archive
Proceedings of the 2007 ACM/IEEE conference on Supercomputing - Volume 00 table of contents
Reno, Nevada
SESSION: Performance tools and methods table of contents
Article No. 31  
Year of Publication: 2007
ISBN:978-1-59593-764-3
Authors
Parry Husbands  Lawrence Berkeley National Laboratory, Berkeley, CA
Katherine Yelick  University of California at Berkeley, Berkeley, CA
Sponsors
IEEE-CS\DATC : IEEE Computer Society
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 21,   Downloads (12 Months): 84,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1362622.1362664
What is a DOI?

ABSTRACT

Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem has non-trivial dependence patterns which limit parallelism, and local computations require large matrices in order to achieve good single processor performance. We present an alternative programming model for this type of problem, which combines UPC's global address space with lightweight multithreading. We introduce the concept of memory-constrained lookahead where the amount of concurrency managed by each processor is controlled by the amount of memory available. We implement novel techniques for steering the computation to optimize for high performance and demonstrate the scalability and portability of UPC with Teraflop level performance on some machines, comparing favourably to other state-of-the-art MPI codes.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Allen E., Chase D., Hallett J., Luchangco V., Maessen J.-W., Ryu S., Steele G. L., and Tobin-Hochstadt S. The Fortress Language Specification, 2007. Available at http://research.sun.com/projects/plrg/Publications/index.html
 
3
Bailey D. Little's Law and High Performance Computing, 1997. Available at: http://crd.lbl.gov/~dhbailey/dhbpapers/little.pdf
4
 
5
 
6
 
7
Bonachea D. Proposal for Extending the UPC Memory Copy Library Functions and Supporting Extensions to GASNet, v1.0. Lawrence Berkeley National Laboratory Technical Report LBNL-54983, 2004.
 
8
Buttari A., Dongarra J., Kurzak J., Langou J., Luszczek P., and Tomov S. The Impact of Multicore on Math Software. In Proceedings of PARA 2006, Umeå, Sweden, June 2006.
 
9
Buttari A., Langou J., Kurzak J., and Dongarra J. Parallel Tiled QR Factorization for Multicore Architectures. Technical Report UT-CS-07-598, University of Tennessee, Computer Science Department, July 2007. Also published as LAPACK Working Note 190.
 
10
Callahan D., Chamberlain B. L., and Zima, H. P. The Cascade High Productivity Language. In Proceedings of the 9th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2004), 52--60, IEEE Computer Society, 2004.
11
 
12
Jaeyoung Choi , Jack J. Dongarra , L. Susan Ostrouchov , Antoine P. Petitet , David W. Walker , R. Clint Whaley, Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines, Scientific Programming, v.5 n.3, p.173-184, Fall 1996
 
13
Ebcioglu K., Saraswat V., and Sarkar, V. X10: an Experimental Language for High Productivity Programming of Scalable Systems. In Proceedings of the P-PHEC 2005 Workshop, held in conjunction with HPCA 2005, 2005.
 
14
 
15
 
16
Golub G. and Van Loan, C. Matrix Computations. Johns Hopkins University Press, 1996.
 
17
Goto K. and van de Geijn R. On reducing TLB misses in matrix multiplication. Technical Report TR-2002-55, The University of Texas at Austin, Department of Computer Sciences, 2002. Also published as FLAME Working Note #9.
 
18
 
19
 
20
HPC Challenge Benchmark Page. Available at: http://icl.cs.utk.edu/hpcc/
21
 
22
Kurzak J. and Dongarra J. Implementing Linear Algebra Routines on Multi-Core Processors with Pipelining and a Look Ahead. Technical Report UT-CS-06-581, University of Tennessee, Computer Science Department, 2006. Also published as LAPACK Working Note 178.
 
23
Lewis B. and Richards K. LU Factorization Case Study Using FAST: Dataflow Parallelism with the Forte Application Scalability Tool. 2003. Available at: http://developers.sun.com/prodtech/cc/articles/FAST/lu_content.html
24
 
25
Lin H.-Y. and Luszczek P. Tuning LINPACK N*N for PARISC Platforms. Presented at the 2001 High Performance Computing on Hewlett-Packard Systems Conference, 2001.
26
27
28
 
29
Panziera J.-P. and Baron J. A Highly Efficient Linpack Implementation Based on Shared-Memory Parallelism. In Proceedings of the 2005 International Supercomputer Conference, 2005.
30
 
31
 
32
 
33
The Top 500 Supercomputer Sites. Available at: http://www.top500.org, 2007.
 
34
UPC Consortium. UPC Language Specifications, v1.2. Available at: http://upc.lbl.gov/docs/user/upc_spec_1.2.pdf, 2005.


Collaborative Colleagues:
Parry Husbands: colleagues
Katherine Yelick: colleagues