|
ABSTRACT
A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an implementation of the Linpack benchmark for a petascale hybrid system, the LANL Roadrunner cluster built by IBM for Los Alamos National Laboratory. This system combines traditional x86-64 host processors with IBM PowerXCell™" 8i1 accelerator processors. The implementation of Linpack we developed was the first to achieve a performance result in excess of 1.0 PFLOPS, and made Roadrunner the #1 system on the Top500 list in June 2008. We describe the design and implementation of hybrid Linpack, including the special optimizations we developed for this hybrid architecture. We then present actual results for single node and multi-node executions. From this work, we conclude that it is possible to achieve high performance for certain applications on hybrid architectures when careful attention is given to efficient use of memory bandwidth, scheduling of data movement between the host and accelerator memories, and proper distribution of work between the host and accelerator processors.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Advanced Micro Devices, AMD Core Math Library, http://www.amd.com/acml
|
| |
2
|
W. Alvaro, J. Kurzak, and J. J. Dongarra, Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the CELL Processor. UT-CS-08-609, January 2008.
|
| |
3
|
Kevin J. Barker , Kei Davis , Adolfy Hoisie , Darren J. Kerbyson , Mike Lang , Scott Pakin , Jose C. Sancho, Entering the petaflop era: the architecture and performance of Roadrunner, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15-21, 2008, Austin, Texas
|
| |
4
|
K. J. Bowers , B. J. Albright , B. Bergen , L. Yin , K. J. Barker , D. J. Kerbyson, 0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on Roadrunner, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15-21, 2008, Austin, Texas
|
| |
5
|
A. Buttari, P. Luszczek, J. Kurzak, J. Dongarra, and G. Bosilca, A Rough Guide to Scientific Computing on the PlayStation 3, Technical Report UT-CS-07-595, Innovative Computing Laboratory, University of Tennessee Knoxville, May 11, 2007.
|
| |
6
|
|
| |
7
|
ClearSpeed, Accelerated HPC Clusters, http://www.clearspeed.com/acceleration/accelhpcclusters/
|
 |
8
|
|
| |
9
|
|
| |
10
|
A. E. Eichenberger , J. K. O'Brien , K. M. O'Brien , P. Wu , T. Chen , P. H. Oden , D. A. Prener , J. C. Shepherd , B. So , Z. Sura , A. Wang , T. Zhang , P. Zhao , M. K. Gschwind , R. Archambault , Y. Gao , R. Koo, Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture, IBM Systems Journal, v.45 n.1, p.59-84, January 2006
|
| |
11
|
E. Gabriel, G. Fagg, G. Bosilca, et al, Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation, Euro PVM/MPI, September, 2004.
|
 |
12
|
|
| |
13
|
|
 |
14
|
|
| |
15
|
IBM, The IBM Software Kit for Multicore Acceleration Version 3.0 http://www.ibm.com/chips/techlib/techlib.nsf/products/IBM_SDK_for_Multicore_Acceleration
|
| |
16
|
IBM, Data Communication and Synchronization Library for Hybrid-x86 Programmers Guide and API Reference, October 2007.
|
| |
17
|
Intel Corp, Intel® Xeon® Processor 5000 Sequence: HPC Benchmarks: Dense Floating--point, http://www.intel.com/performance/server/xeon/hpcapp.htm
|
| |
18
|
|
| |
19
|
D. J. Kerbyson and A. Hoisie, Analysis of Wavefront Algorithms on Large-scale Two-level Heterogeneous Processing Systems, Workshop on Unique Chips and Systems (UCAS2), IEEE Symposium on Performance Analysis of Systems and Software (ISPASS06), Austin, TX, Mar 2006
|
| |
20
|
M. Kistler, J. Gunnels, D. Brokenshire, B. Benton, Programming the Linpack Benchmark for Roadrunner, IBM Journal of Research and Development, to appear
|
| |
21
|
J. Kurzak and J. Dongarra, Implementing Linear Algebra Routines on Multi-Core Processors with Pipelining and a Look Ahead, UT-CS-06-581, September 2006.
|
 |
22
|
|
| |
23
|
Message Passing Interface Forum. MPI: A Message Passing Interface Standard, June 1995. http://www.mpi-forum.org.
|
| |
24
|
Message Passing Interface Forum. MPI-2: Extensions to the Message Passing Interface, July 1997. http://www.mpi-forum.org
|
| |
25
|
OpenMP Specifications, http://www.openmp.org/drupal/node/view/8
|
| |
26
|
G. Quintana-Orti, F. D. Igual, E. S. Quintana-Orti, R. van de Geijn, Solving Dense Linear Algebra Problems on Platforms with Multiple Hardware Accelerators, The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-08-22. May, 2008.
|
| |
27
|
Panziera J.-P. and Baron J. A Highly Efficient Linpack Implementation Based on Shared-Memory Parallelism. In Proceedings of the 2005 International Supercomputer Conference, 2005.
|
| |
28
|
A. Petitet, R. C. Whaley, J. J. Dongarra, and A. Cleary. HPL -- A portable implementation of the high-performance linpack benchmark for distributed memory computers. http://www.netlib.org/benchmark/hpl/, 2006
|
| |
29
|
|
| |
30
|
V. Volkov and J. Demmel, LU, QR, and Cholesky Factorizations using Vector Capabilities of GPUs, University of California at Berkeley Technical Report UCB/EECS-2008-49, May 2008
|
|