ACM Home Page
Please provide us with feedback. Feedback
Petascale computing with accelerators
Full text PdfPdf (765 KB)
Source
Principles and Practice of Parallel Programming archive
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming table of contents
Raleigh, NC, USA
SESSION: High end computing software table of contents
Pages 241-250  
Year of Publication: 2009
ISBN:978-1-60558-397-6
Also published in ...
Authors
Michael Kistler  IBM Corporation, Austin, TX, USA
John Gunnels  IBM Corporation, Yorktown, NY, USA
Daniel Brokenshire  IBM Corporation, Austin, TX, USA
Brad Benton  IBM Corporation, Austin, TX, USA
Sponsors
ACM: Association for Computing Machinery
SIGPLAN: ACM Special Interest Group on Programming Languages
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 66,   Downloads (12 Months): 309,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1504176.1504212
What is a DOI?

ABSTRACT

A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an implementation of the Linpack benchmark for a petascale hybrid system, the LANL Roadrunner cluster built by IBM for Los Alamos National Laboratory. This system combines traditional x86-64 host processors with IBM PowerXCell™" 8i1 accelerator processors. The implementation of Linpack we developed was the first to achieve a performance result in excess of 1.0 PFLOPS, and made Roadrunner the #1 system on the Top500 list in June 2008. We describe the design and implementation of hybrid Linpack, including the special optimizations we developed for this hybrid architecture. We then present actual results for single node and multi-node executions. From this work, we conclude that it is possible to achieve high performance for certain applications on hybrid architectures when careful attention is given to efficient use of memory bandwidth, scheduling of data movement between the host and accelerator memories, and proper distribution of work between the host and accelerator processors.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Advanced Micro Devices, AMD Core Math Library, http://www.amd.com/acml
 
2
W. Alvaro, J. Kurzak, and J. J. Dongarra, Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the CELL Processor. UT-CS-08-609, January 2008.
 
3
 
4
 
5
A. Buttari, P. Luszczek, J. Kurzak, J. Dongarra, and G. Bosilca, A Rough Guide to Scientific Computing on the PlayStation 3, Technical Report UT-CS-07-595, Innovative Computing Laboratory, University of Tennessee Knoxville, May 11, 2007.
 
6
 
7
ClearSpeed, Accelerated HPC Clusters, http://www.clearspeed.com/acceleration/accelhpcclusters/
8
 
9
 
10
 
11
E. Gabriel, G. Fagg, G. Bosilca, et al, Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation, Euro PVM/MPI, September, 2004.
12
 
13
14
 
15
IBM, The IBM Software Kit for Multicore Acceleration Version 3.0 http://www.ibm.com/chips/techlib/techlib.nsf/products/IBM_SDK_for_Multicore_Acceleration
 
16
IBM, Data Communication and Synchronization Library for Hybrid-x86 Programmers Guide and API Reference, October 2007.
 
17
Intel Corp, Intel® Xeon® Processor 5000 Sequence: HPC Benchmarks: Dense Floating--point, http://www.intel.com/performance/server/xeon/hpcapp.htm
 
18
 
19
D. J. Kerbyson and A. Hoisie, Analysis of Wavefront Algorithms on Large-scale Two-level Heterogeneous Processing Systems, Workshop on Unique Chips and Systems (UCAS2), IEEE Symposium on Performance Analysis of Systems and Software (ISPASS06), Austin, TX, Mar 2006
 
20
M. Kistler, J. Gunnels, D. Brokenshire, B. Benton, Programming the Linpack Benchmark for Roadrunner, IBM Journal of Research and Development, to appear
 
21
J. Kurzak and J. Dongarra, Implementing Linear Algebra Routines on Multi-Core Processors with Pipelining and a Look Ahead, UT-CS-06-581, September 2006.
22
 
23
Message Passing Interface Forum. MPI: A Message Passing Interface Standard, June 1995. http://www.mpi-forum.org.
 
24
Message Passing Interface Forum. MPI-2: Extensions to the Message Passing Interface, July 1997. http://www.mpi-forum.org
 
25
OpenMP Specifications, http://www.openmp.org/drupal/node/view/8
 
26
G. Quintana-Orti, F. D. Igual, E. S. Quintana-Orti, R. van de Geijn, Solving Dense Linear Algebra Problems on Platforms with Multiple Hardware Accelerators, The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-08-22. May, 2008.
 
27
Panziera J.-P. and Baron J. A Highly Efficient Linpack Implementation Based on Shared-Memory Parallelism. In Proceedings of the 2005 International Supercomputer Conference, 2005.
 
28
A. Petitet, R. C. Whaley, J. J. Dongarra, and A. Cleary. HPL -- A portable implementation of the high-performance linpack benchmark for distributed memory computers. http://www.netlib.org/benchmark/hpl/, 2006
 
29
 
30
V. Volkov and J. Demmel, LU, QR, and Cholesky Factorizations using Vector Capabilities of GPUs, University of California at Berkeley Technical Report UCB/EECS-2008-49, May 2008


Collaborative Colleagues:
Michael Kistler: colleagues
John Gunnels: colleagues
Daniel Brokenshire: colleagues
Brad Benton: colleagues