ACM Home Page
Please provide us with feedback. Feedback
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Full text PdfPdf (372 KB)
Source Conference on High Performance Networking and Computing archive
Proceedings of the 2008 ACM/IEEE conference on Supercomputing - Volume 00 table of contents
Austin, Texas
SECTION: Papers table of contents
Article No. 4  
Year of Publication: 2008
ISBN:978-1-4244-2835-9
Authors
Kaushik Datta  Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
Mark Murphy  University of California at Berkeley, Berkeley, CA
Vasily Volkov  University of California at Berkeley, Berkeley, CA
Samuel Williams  Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
Jonathan Carter  Lawrence Berkeley National Laboratory, Berkeley, CA
Leonid Oliker  Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
David Patterson  Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
John Shalf  Lawrence Berkeley National Laboratory, Berkeley, CA
Katherine Yelick  Lawrence Berkeley National Laboratory, Berkeley, CA and University of California at Berkeley, Berkeley, CA
Publisher
IEEE Press  Piscataway, NJ, USA
Bibliometrics
Downloads (6 Weeks): 57,   Downloads (12 Months): 582,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  

ABSTRACT

Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations --- a class of algorithms at the heart of many structured grid codes, including PDF solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
K. Asanovic, R. Bodik, B. Catanzaro et al., "The landscape of parallel computing research: A view from Berkeley," EECS, University of California, Berkeley, Tech. Rep. UCB/EECS-2006-183, 2006.
 
2
M. Berger and J. Oliger, "Adaptive mesh refinement for hyperbolic partial differential equations," Journal of Computational Physics, vol. 53, pp. 484--512, 1984.
 
3
 
4
5
6
 
7
S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick, "Lattice Boltzmann simulation optimization on leading multicore platforms," in Interational Conference on Parallel and Distributed Computing Systems (IPDPS), Miami, Florida, 2008.
8
9
 
10
NVIDIA CUDA Programming Guide 1.1, November 2007. {Online}. Available: http://www.nvidia.com/object/cuda_develop.html
 
11
R. C. Whaley, A. Petitet, and J. Dongarra, "Automated Empirical Optimization of Software and the ATLAS project," Parallel Computing, vol. 27(1--2), pp. 3--35, 2001.
 
12
R. Vuduc, J. Demmel, and K. Yelick, "OSKI: A library of automatically tuned sparse matrix kernels," in Proc. of SciDAC 2005, J. of Physics: Conference Series. Institute of Physics Publishing, June 2005.
13
 
14
J. D. McCalpin, "STREAM: Sustainable Memory Bandwidth in High Performance Computers," http://www.cs.virginia.edu/stream/.
15


Collaborative Colleagues:
Kaushik Datta: colleagues
Mark Murphy: colleagues
Vasily Volkov: colleagues
Samuel Williams: colleagues
Jonathan Carter: colleagues
Leonid Oliker: colleagues
David Patterson: colleagues
John Shalf: colleagues
Katherine Yelick: colleagues