|
ABSTRACT
The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Cactus homepage. http://www.cactuscode.org.
|
| |
3
|
|
| |
4
|
Cell broadband engine architecture and its first implementation. http://www-128.ibm.com/developerworks/power/library/pa-cellperf/.
|
| |
5
|
Chombo homepage. http://seesar.lbl.gov/anag/chombo.
|
| |
6
|
E. D'Azevedo, M. R. Fahey, and R. T. Mills. Vectorized sparse matrix multiply for compressed row storage format. In International Conference on Computational Science (ICCS), pages 99--106, 2005.
|
| |
7
|
FFTW speed tests. http://www.fftw.org.
|
| |
8
|
B. Flachs, S. Asano, S. Dhong, et al. A streaming processor unit for a cell processor. ISSCC Dig. Tech. Papers, pages 134--135, February 2005.
|
 |
9
|
Poletti Francesco , Paul Marchal , David Atienza , Luca Benini , Francky Catthoor , Jose M. Mendias, An integrated hardware/software approach for run-time scratchpad management, Proceedings of the 41st annual conference on Design automation, June 07-11, 2004, San Diego, CA, USA
[doi> 10.1145/996566.996634]
|
| |
10
|
Ibm cell specifications. http://www.research.ibm.com/cell/home.html.
|
| |
11
|
|
| |
12
|
The Berkeley Intelligent RAM (IRAM) Project. http://iram.cs.berkeley.edu.
|
 |
13
|
|
| |
14
|
J. A. Kahle , M. N. Day , H. P. Hofstee , C. R. Johns , T. R. Maeurer , D. Shippy, Introduction to the cell multiprocessor, IBM Journal of Research and Development, v.49 n.4/5, p.589-604, July 2005
|
 |
15
|
Shoaib Kamil , Parry Husbands , Leonid Oliker , John Shalf , Katherine Yelick, Impact of modern memory subsystems on cache optimizations for stencil computations, Proceedings of the 2005 workshop on Memory system performance, June 12-12, 2005, Chicago, Illinois
[doi> 10.1145/1111583.1111589]
|
 |
16
|
M. Kandemir , J. Ramanujam , J. Irwin , N. Vijaykrishnan , I. Kadayif , A. Parikh, Dynamic management of scratch-pad memory space, Proceedings of the 38th conference on Design automation, p.690-695, June 2001, Las Vegas, Nevada, United States
[doi> 10.1145/378239.379049]
|
| |
17
|
P. Keltcher, S. Richardson, S. Siu, et al. An equal area comparison of embedded dram and sram memory architectures for a chip multiprocessor. Technical report, HP Laboratories, April 2000.
|
| |
18
|
Brucek Khailany , William J. Dally , Ujval J. Kapasi , Peter Mattson , Jinyung Namkoong , John D. Owens , Brian Towles , Andrew Chang , Scott Rixner, Imagine: Media Processing with Streams, IEEE Micro, v.21 n.2, p.35-46, March 2001
[doi> 10.1109/40.918001]
|
| |
19
|
M. Kondo, H. Okawara, H. Nakamura, et al. Scima: A novel processor architecture for high performance computing. In 4th International Conference on High Performance Computing in the Asia Pacific Region, volume 1, May 2000.
|
| |
20
|
Atsushi Kunimatsu , Nobuhiro Ide , Toshinori Sato , Yukio Endo , Hiroaki Murakami , Takayuki Kamei , Masashi Hirano , Fujio Ishihara , Haruyuki Tago , Masaaki Oka , Akio Ohba , Teiji Yutaka , Toyoshi Okada , Masakazu Suzuoki, Vector Unit Architecture for Emotion Synthesis, IEEE Micro, v.20 n.2, p.40-47, March 2000
[doi> 10.1109/40.848471]
|
 |
21
|
|
| |
22
|
Christian Jacobi , Hwa-Joon Oh , Kevin D. Tran , Scott R. Cottier , Brad W. Michael , Hiroo Nishikawa , Yonetaro Totsuka , Tatsuya Namatame , Naoka Yano, The Vector Floating-Point Unit in a Synergistic Processor Element of a CELL Processor, Proceedings of the 17th IEEE Symposium on Computer Arithmetic, p.59-67, June 27-29, 2005
[doi> 10.1109/ARITH.2005.45]
|
| |
23
|
|
| |
24
|
L. Oliker, R. Biswas, J. Borrill, et al. A performance evaluation of the Cray X1 for scientific applications. In Proc. 6th International Meeting on High Performance Computing for Computational Science, 2004.
|
| |
25
|
Ornl cray x1 evaluation. http://www.csm.ornl.gov/~dunigan/cray.
|
| |
26
|
|
| |
27
|
D. Pham, S. Asano, M. Bollier, et al. The design and implementation of a first-generation cell processor. ISSCC Dig. Tech. Papers, pages 184--185, February 2005.
|
| |
28
|
Sony press release. http://www.scei.co.jp/corporate/release/pdf/050517e.pdf.
|
| |
29
|
M. Suzuoki et al. A microprocessor with a 128-bit cpu, ten floating point macs, four floating-point dividers, and an mpeg-2 decoder. IEEE Solid State Circuits, 34(1), November 1999.
|
| |
30
|
|
| |
31
|
|
| |
32
|
|
CITED BY 37
|
|
|
|
|
|
|
|
Shoaib Kamil , Kaushik Datta , Samuel Williams , Leonid Oliker , John Shalf , Katherine Yelick, Implicit and explicit optimizations for stencil computations, Proceedings of the 2006 workshop on Memory system performance and correctness, October 22-22, 2006, San Jose, California
|
|
|
Mattan Erez , Jung Ho Ahn , Jayanth Gummaraju , Mendel Rosenblum , William J. Dally, Executing irregular scientific applications on stream architectures, Proceedings of the 21st annual international conference on Supercomputing, June 17-21, 2007, Seattle, Washington
|
|
|
Filip Blagojevic , Dimitris S. Nikolopoulos , Alexandros Stamatakis , Christos D. Antonopoulos, Dynamic multigrain parallelization on the cell broadband engine, Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, March 14-17, 2007, San Jose, California, USA
|
|
|
|
|
|
Arun Kumar , Naresh Jayam , Ashok Srinivasan , Ganapathy Senthilkumar , Pallav K. Baruah , Shakti Kapoor , Murali Krishna , Raghunath Sarma, Feasibility study of MPI implementation on the heterogeneous multi-core cell BE™ architecture, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, June 09-11, 2007, San Diego, California, USA
|
|
|
|
|
|
J. Barnat , L. Brim , I. Černá , S. Dražan , D. Šafránek, Parallel Model Checking Large-Scale Genetic Regulatory Networks with DiVinE, Electronic Notes in Theoretical Computer Science (ENTCS), v.194 n.3, p.35-50, January, 2008
|
|
|
B. Flachs , S. Asano , S. H. Dhong , H. P. Hofstee , G. Gervais , R. Kim , T. Le , P. Liu , J. Leenstra , J. S. Liberty , B. Michael , H.-J. Oh , S. M. Mueller , O. Takahashi , K. Hirairi , A. Kawasumii , H. Murakami , H. Noro , S. Onishi , J. Pille , J. Silberman , S. Yong , A. Hatakeyama , Y. Watanabe , N. Yano , D. A. Brokenshire , M. Peyravian , V. To , E. Iwata, Microarchitecture and implementation of the synergistic processor in 65-nm and 90-nm SOI, IBM Journal of Research and Development, v.51 n.5, p.529-543, September 2007
|
|
|
Samuel Williams , John Shalf , Leonid Oliker , Shoaib Kamil , Parry Husbands , Katherine Yelick, Scientific computing Kernels on the cell processor, International Journal of Parallel Programming, v.35 n.3, p.263-298, June 2007
|
|
|
|
|
|
|
|
|
Dominik Goddeke , Robert Strzodka , Jamaludin Mohd-Yusof , Patrick McCormick , Hilmar Wobker , Christian Becker , Stefan Turek, Using GPUs to improve multigrid solver performance on a cluster, International Journal of Computational Science and Engineering, v.4 n.1, p.36-55, November 2008
|
|
|
|
|
|
Filip Blagojevic , Dimitrios S. Nikolopoulos , Alexandros Stamatakis , Christos D. Antonopoulos , Matthew Curtis-Maury, Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems, Parallel Computing, v.33 n.10-11, p.700-719, November, 2007
|
|
|
Vicenç Beltran , Jordi Torres , Eduard Ayguadé, Improving disk bandwidth-bound applications through main memory compression, Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture, p.57-63, September 16-16, 2007, Brasov, Romania
|
|
|
Kevin J. Barker , Kei Davis , Adolfy Hoisie , Darren J. Kerbyson , Mike Lang , Scott Pakin , Jose C. Sancho, Entering the petaflop era: the architecture and performance of Roadrunner, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15-21, 2008, Austin, Texas
|
|
|
|
|
|
Kaushik Datta , Mark Murphy , Vasily Volkov , Samuel Williams , Jonathan Carter , Leonid Oliker , David Patterson , John Shalf , Katherine Yelick, Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15-21, 2008, Austin, Texas
|
|
|
|
|
|
Li Wang , Xuejun Yang , Jingling Xue , Yu Deng , Xiaobo Yan , Tao Tang , Quan Hoang Nguyen, Optimizing scientific application loops on stream processors, ACM SIGPLAN Notices, v.43 n.7, July 2008
|
|
|
|
|
|
|
|
|
|
|
|
Naga K. Govindaraju , Brandon Lloyd , Yuri Dotsenko , Burton Smith , John Manferdelli, High performance discrete Fourier transforms on graphics processors, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15-21, 2008, Austin, Texas
|
|
|
Ana Lucia Varbanescu , Alexander S. van Amesfoort , Tim Cornwell , Ger van Diepen , Rob van Nieuwpoort , Bruce G. Elmegreen , Henk Sips, Building high-resolution sky images using the Cell/B.E., Scientific Programming, v.17 n.1-2, p.113-134, January 2009
|
|
|
Mauricio Araya-Polo , Félix Rubio , Raúl de la Cruz , Mauricio Hanzich , José María Cela , Daniele Paolo Scarpazza, 3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors, Scientific Programming, v.17 n.1-2, p.185-198, January 2009
|
|
|
|
|
|
Xuejun Yang , Li Wang , Jingling Xue , Yu Deng , Ying Zhang, Comparability graph coloring for optimizing utilization of stream register files in stream processors, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
|
|
|
|
|
|
|
|
J. Barnat , L. Brim , I. erná , S. Draan , J. Fabriková , D. afránek, On algorithmic analysis of transcriptional regulation by LTL model checking, Theoretical Computer Science, v.410 n.33-34, p.3128-3148, August, 2009
|
|
|
|
|
|
Filip Blagojevic , Costin Iancu , Katherine Yelick , Matthew Curtis-Maury , Dimitrios S. Nikolopoulos , Benjamin Rose, Scheduling dynamic parallelism on accelerators, Proceedings of the 6th ACM conference on Computing frontiers, May 18-20, 2009, Ischia, Italy
|
|
|
|
|
|
|
|