|
ABSTRACT
The recent emergence of compute-intensive stream processors such as the Cell Broadband Engine, Stanford's Merrimac, and Clear-Speed's CSX600 has made them attractive platforms for scientific high-performance computing. Unstructured mesh and graph applications are an important class of numerical algorithms used in the scientific computing domain, which are particularly challenging for stream architectures. These codes have irregular structures where nodes have a variable number of neighbors, resulting in irregular memory access patterns and irregular control. We study four representative sub-classes of irregular algorithms, including finite-element and finite-volume methods for modeling physical systems, direct methods for n-body problems, and computations involving sparse algebra. We propose a framework for representing the diverse characteristics of these algorithms in the context of the unique properties of stream architectures, and demonstrate it using one representative application from each sub-class. We then develop techniques for mapping the applications onto a stream processor, placing emphasis on data-localization and parallelizations. Our simulations show that efficient stream hardware with restricted control abilities can effectively run challenging irregular applications with, for example, a finite element method and a molecular dynamic code sustaining 69GFLOP/s and 46GFLOP/s (64-bit) respectively using a single chip that measures 12mm on a side and consumes less than 70W in 90nm technology.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Jung Ho Ahn , Mattan Erez , William J. Dally, Tradeoff between data-, instruction-, and thread-level parallelism in stream processors, Proceedings of the 21st annual international conference on Supercomputing, June 17-21, 2007, Seattle, Washington
[doi> 10.1145/1274971.1274991]
|
| |
2
|
|
 |
3
|
|
 |
4
|
J. R. Allen , Ken Kennedy , Carrie Porterfield , Joe Warren, Conversion of control dependence to data dependence, Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, p.177-189, January 24-26, 1983, Austin, Texas
[doi> 10.1145/567067.567085]
|
| |
5
|
T. Barth. Simplified discontinuous Galerkin methods for systems of conservation laws with convex extension. In Cockburn, Karniadakis, and Shu, editors, Discontinuous Galerkin Methods, volume 11 of Lecture Notes in Computational Science and Engineering. Springer-Verlag, Heidelberg, 1999.
|
| |
6
|
|
| |
7
|
|
| |
8
|
W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to UPC and language specification. University of California-Berkeley Technical Report: CCS-TR-99-157, 1999.
|
| |
9
|
John B. Carter , Wilson C. Hsieh , Leigh B. Stoller , Mark Swanson , Lixin Zhang , Sally A. McKee, Impulse: Memory system support for scientific applications, Scientific Programming, v.7 n.3-4, p.195-209, August 1999
|
| |
10
|
ClearSpeed. CSX600 Datasheet. http://www.clearspeed.com/downloads/CSX600Processor.pdf, 2005.
|
| |
11
|
|
| |
12
|
William J. Dally , Francois Labonte , Abhishek Das , Patrick Hanrahan , Jung-Ho Ahn , Jayanth Gummaraju , Mattan Erez , Nuwan Jayasena , Ian Buck , Timothy J. Knight , Ujval J. Kapasi, Merrimac: Supercomputing with Streams, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p.35, November 15-21, 2003
|
| |
13
|
|
| |
14
|
E. F. D'Azevedo, M. R. Fahey, and R. T. Mills. Vectorized sparse matrix multiply for compressed row storage format. In proceedings of the 2005 International Conference on Computational Science (ICCS'05), pages 99--106, May 2005.
|
 |
15
|
Steven J. Deitz , Bradford L. Chamberlain , Sung-Eun Choi , Lawrence Snyder, The design and implementation of a parallel array operator for the arbitrary remapping of data, Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, June 11-13, 2003, San Diego, California, USA
|
| |
16
|
S. J. Deitz, B. L. Chamberlain, and L. Snyder. Abstractions for dynamic data distribution. In Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, pages 42--51. IEEE Computer Society, 2004.
|
 |
17
|
|
| |
18
|
N. Goharian, T. El-Ghazawi, D. Grossman, and A. Chowdhury. On the enhancements of a sparse matrix information retrieval approach. PDPTA, 2000.
|
| |
19
|
M. Guo. Automatic parallelization and optimization for irregular scientic applications. In 18th International Parallel and Distributed Processing Symposium, 2005.
|
| |
20
|
H. Han, G. Rivera, and C. Tseng. Software support for improving locality in scientic codes. In Compilers for Parallel Computation, 2000.
|
| |
21
|
|
| |
22
|
|
| |
23
|
Yuan-Shin Hwang , Bongki Moon , Shamik D. Sharma , Ravi Ponnusamy , Raja Das , Joel H. Saltz, Runtime and language support for compiling adaptive irregular programs on distributed-memory machines, Software—Practice & Experience, v.25 n.6, p.597-621, June 1995
[doi> 10.1002/spe.4380250603]
|
| |
24
|
|
 |
25
|
Ujval J. Kapasi , William J. Dally , Scott Rixner , Peter R. Mattson , John D. Owens , Brucek Khailany, Efficient conditional operations for data-parallel architectures, Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, p.159-170, December 2000, Monterey, California, United States
[doi> 10.1145/360128.360145]
|
| |
26
|
Ujval J. Kapasi , Scott Rixner , William J. Dally , Brucek Khailany , Jung Ho Ahn , Peter Mattson , John D. Owens, Programmable Stream Processors, Computer, v.36 n.8, p.54-62, August 2003
[doi> 10.1109/MC.2003.1220582]
|
| |
27
|
|
| |
28
|
K. Kitagawa, S. Tagaya, Y. Hagihara, and Y. Kanoh. A hardware overview of SX-6 and SX-7 supercomputer. NEC Research and Development, 44(1):27, January 2003.
|
| |
29
|
|
 |
30
|
|
 |
31
|
|
 |
32
|
|
 |
33
|
|
 |
34
|
|
| |
35
|
D. I. Pullin and D. J. Hill. Computational methods for shock-driven turbulence and les of the richtmyer-meshkov instability. USNCCM, 2003.
|
| |
36
|
D. Roccatano, R. Bizzarri, G. Chillemi, N. Sanna, and A. D. Nola. Development of a parallel molecular dynamics code on SIMD computers: Algorithm for use of pair list criterion. Journal of Computational Chemistry, 19(7):685--694, 1998.
|
 |
37
|
|
| |
38
|
|
| |
39
|
|
| |
40
|
P. Sanders. Efficient emulation of MIMD behavior on SIMD machines. Technical Report iratr-1995-29, 1995.
|
| |
41
|
|
 |
42
|
|
| |
43
|
T. Sterling, D. Savarese, D. J. Becker, J. E. Dorband, U. A. Ranawake, and C. V. Packer. BEOWULF: A parallel workstation for scientific computation. In Proceedings of the 24th International Conference on Parallel Processing, pages I:11--14, 1995.
|
| |
44
|
|
| |
45
|
Michael Bedford Taylor , Jason Kim , Jason Miller , David Wentzlaff , Fae Ghodrat , Ben Greenwald , Henry Hoffman , Paul Johnson , Jae-Wook Lee , Walter Lee , Albert Ma , Arvind Saraf , Mark Seneski , Nathan Shnidman , Volker Strumpen , Matt Frank , Saman Amarasinghe , Anant Agarwal, The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs, IEEE Micro, v.22 n.2, p.25-35, March 2002
[doi> 10.1109/MM.2002.997877]
|
| |
46
|
D. van der Spoel, A. R. van Buuren, E. Apol, P. J. Meulenhoff, D. P. Tieleman, A. L. T. M. Sijbers, B. Hess, K. A. Feenstra, E. Lindahl, R. van Drunen, and H. J. C. Berendsen. Gromacs User Manual version 3.1. Nijenborgh 4, 9747 AG Groningen, The Netherlands. Internet: http://www.gromacs.org, 2001.
|
 |
47
|
Samuel Williams , John Shalf , Leonid Oliker , Shoaib Kamil , Parry Husbands , Katherine Yelick, The potential of the cell processor for scientific computing, Proceedings of the 3rd conference on Computing frontiers, May 03-05, 2006, Ischia, Italy
[doi> 10.1145/1128022.1128027]
|
| |
48
|
K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance Java dialect. In ACM 1998 Workshop on Java for High-Performance Network Computing, Stanford, California, 1998.
|
CITED BY 3
|
|
|
|
|
Jacob Leverich , Hideho Arakida , Alex Solomatnikov , Amin Firoozshahian , Mark Horowitz , Christos Kozyrakis, Comparative evaluation of memory models for chip multiprocessors, ACM Transactions on Architecture and Code Optimization (TACO), v.5 n.3, p.1-30, November 2008
|
|
|
|
|