| Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures |
| Full text |
Pdf
(372 KB)
|
| Source
|
Conference on High Performance Networking and Computing
archive
Proceedings of the 2008 ACM/IEEE conference on Supercomputing - Volume 00
table of contents
Austin, Texas
Article No. 4
Year of Publication: 2008
ISBN:978-1-4244-2835-9
|
|
Authors
|
|
Kaushik Datta
|
Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
|
|
Mark Murphy
|
University of California at Berkeley, Berkeley, CA
|
|
Vasily Volkov
|
University of California at Berkeley, Berkeley, CA
|
|
Samuel Williams
|
Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
|
|
Jonathan Carter
|
Lawrence Berkeley National Laboratory, Berkeley, CA
|
|
Leonid Oliker
|
Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
|
|
David Patterson
|
Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
|
|
John Shalf
|
Lawrence Berkeley National Laboratory, Berkeley, CA
|
|
Katherine Yelick
|
Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
|
|
| Publisher |
IEEE Press
Piscataway, NJ, USA
|
| Bibliometrics |
Downloads (6 Weeks): 57, Downloads (12 Months): 582, Citation Count: 5
|
|
|
ABSTRACT
Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations --- a class of algorithms at the heart of many structured grid codes, including PDF solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
K. Asanovic, R. Bodik, B. Catanzaro et al., "The landscape of parallel computing research: A view from Berkeley," EECS, University of California, Berkeley, Tech. Rep. UCB/EECS-2006-183, 2006.
|
| |
2
|
M. Berger and J. Oliger, "Adaptive mesh refinement for hyperbolic partial differential equations," Journal of Computational Physics, vol. 53, pp. 484--512, 1984.
|
| |
3
|
|
| |
4
|
|
 |
5
|
|
 |
6
|
Shoaib Kamil , Kaushik Datta , Samuel Williams , Leonid Oliker , John Shalf , Katherine Yelick, Implicit and explicit optimizations for stencil computations, Proceedings of the 2006 workshop on Memory system performance and correctness, October 22-22, 2006, San Jose, California
[doi> 10.1145/1178597.1178605]
|
| |
7
|
S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick, "Lattice Boltzmann simulation optimization on leading multicore platforms," in Interational Conference on Parallel and Distributed Computing Systems (IPDPS), Miami, Florida, 2008.
|
 |
8
|
Samuel Williams , John Shalf , Leonid Oliker , Shoaib Kamil , Parry Husbands , Katherine Yelick, The potential of the cell processor for scientific computing, Proceedings of the 3rd conference on Computing frontiers, May 03-05, 2006, Ischia, Italy
[doi> 10.1145/1128022.1128027]
|
 |
9
|
|
| |
10
|
NVIDIA CUDA Programming Guide 1.1, November 2007. {Online}. Available: http://www.nvidia.com/object/cuda_develop.html
|
| |
11
|
R. C. Whaley, A. Petitet, and J. Dongarra, "Automated Empirical Optimization of Software and the ATLAS project," Parallel Computing, vol. 27(1--2), pp. 3--35, 2001.
|
| |
12
|
R. Vuduc, J. Demmel, and K. Yelick, "OSKI: A library of automatically tuned sparse matrix kernels," in Proc. of SciDAC 2005, J. of Physics: Conference Series. Institute of Physics Publishing, June 2005.
|
 |
13
|
Shoaib Kamil , Parry Husbands , Leonid Oliker , John Shalf , Katherine Yelick, Impact of modern memory subsystems on cache optimizations for stencil computations, Proceedings of the 2005 workshop on Memory system performance, June 12-12, 2005, Chicago, Illinois
[doi> 10.1145/1111583.1111589]
|
| |
14
|
J. D. McCalpin, "STREAM: Sustainable Memory Bandwidth in High Performance Computers," http://www.cs.virginia.edu/stream/.
|
 |
15
|
Samuel Williams , Leonid Oliker , Richard Vuduc , John Shalf , Katherine Yelick , James Demmel, Optimization of sparse matrix-vector multiplication on emerging multicore platforms, Proceedings of the 2007 ACM/IEEE conference on Supercomputing, November 10-16, 2007, Reno, Nevada
[doi> 10.1145/1362622.1362674]
|
|