|
ABSTRACT
The Cell BE processor is a heterogeneous multicore that contains one PowerPC Processor Element (PPE) and eight Synergistic Processor Elements (SPEs). Each SPE has a small software-managed local store. Applications must explicitly control all DMA transfers of code and data between the SPE local stores and the main memory, and they must perform any coherence actions required for data transferred. The need for explicit memory management, together with the limited size of the SPE local stores, makes it challenging to program the Cell BE and achieve high performance. In this paper, we present the design and implementation of our COMIC runtime system and its programming model. It provides the program with an illusion of a globally shared memory, in which the PPE and each of the SPEs can access any shared data item, without the programmer having to worry about where the data is, or how to obtain it. COMIC is implemented entirely in software with the aid of user-level libraries provided by the Cell SDK. For each read or write operation in SPE code, a COMIC runtime function is inserted to check whether the data is available in its local store, and to automatically fetch it if it is not. We propose a memory consistency model and a programming model for COMIC, in which the management of synchronization and coherence is centralized in the PPE. To characterize the effectiveness of the COMIC runtime system, we evaluate it with twelve OpenMP benchmark applications on a Cell BE system and an SMP-like homogeneous multicore (Xeon).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Jairo Balart, Marc Gonzalez, Xavier Martorell, Eduard Ayguade, Zehra Sura, Tong Chen, Tao Zhang, Kevin O'brien, and Kathryn O'Brien. A novel asynchronous software cache implementation for the cell/be processor. In LCPC '07: Proceedings of the 20th International Workshop on Languages and Compilers for Parallel Computing, October 2007.
|
| |
2
|
Brian N. Bershad and Matthew J. Zekauskas. Midway: Shared memory parallel programming with entry consistency for distributed memory multiprocessors. Technical Report CMU-CS-91-170, School of Computer Science, Carnegie Mellon University, September 1991.
|
 |
3
|
Angelos Bilas , Cheng Liao , Jaswinder Pal Singh, Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems, Proceedings of the 26th annual international symposium on Computer architecture, p.282-293, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
4
|
OpenMP Architecture Review Board. OpenMP. http://www.openmp.org.
|
| |
5
|
OpenMP Architecture Review Board. OpenMP Application Program Interface. OpenMP Architecture Review Board, version 2.5 edition, May 2005.
|
 |
6
|
John B. Carter , John K. Bennett , Willy Zwaenepoel, Implementation and performance of Munin, Proceedings of the thirteenth ACM symposium on Operating systems principles, p.152-164, October 13-16, 1991, Pacific Grove, California, United States
|
| |
7
|
Tong Chen, Zehra Sura, Kathryn M. O'Brien, and John K. O'Brien. Optimizing the use of static buffers for dma on a cell chip. In LCPC '06: Proceedings of the 19th International Workshop on Languages and Compilers for Parallel Computing, pages 314--329, November 2006. Also in Lecture Notes in Computer Science 4382, Springer 2007.
|
 |
8
|
Tong Chen , Tao Zhang , Zehra Sura , Mar Gonzales Tallada, Prefetching irregular references for software cache on cell, Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, April 05-09, 2008, Boston, MA, USA
[doi> 10.1145/1356058.1356079]
|
| |
9
|
Standard Performance Evaluation Corporation. SPEC 2000. http://www.spec.org/benchmarks.html.
|
| |
10
|
David E. Culler and Jaswinder Pal Singh. Parallel Computer Architecture. Morgan Kaufmann, 1999.
|
| |
11
|
IBM DevloperWorks. Cell broadband engine resouce center. http://www.ibm.com/developerworks/power/cell/downloads.html.
|
| |
12
|
NASA Advanced Supercomputing Division. NAS parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.
|
| |
13
|
Susan J. Eggers and Tor E. Jeremiassen. Eliminating False Sharing. In ICPP '91: Proceedings of the 1991 International Conference on Parallel Processing, volume I, pages 377--381, August 1991.
|
| |
14
|
Alexandre E. Eichenberger , Kathryn O'Brien , Kevin O'Brien , Peng Wu , Tong Chen , Peter H. Oden , Daniel A. Prener , Janice C. Shepherd , Byoungro So , Zehra Sura , Amy Wang , Tao Zhang , Peng Zhao , Michael Gschwind, Optimizing Compiler for the CELL Processor, Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, p.161-172, September 17-21, 2005
[doi> 10.1109/PACT.2005.33]
|
| |
15
|
B. Flachs et. al. A Streaming Processing Unit for a CELL Processor. IEEE International Solid-State Circuits Conference (ISSCC), February 2005.
|
 |
16
|
Kayvon Fatahalian , Daniel Reiter Horn , Timothy J. Knight , Larkhoon Leem , Mike Houston , Ji Young Park , Mattan Erez , Manman Ren , Alex Aiken , William J. Dally , Pat Hanrahan, Sequoia: programming the memory hierarchy, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, November 11-17, 2006, Tampa, Florida
[doi> 10.1145/1188455.1188543]
|
 |
17
|
Kourosh Gharachorloo , Daniel Lenoski , James Laudon , Phillip Gibbons , Anoop Gupta , John Hennessy, Memory consistency and event ordering in scalable shared-memory multiprocessors, Proceedings of the 17th annual international symposium on Computer Architecture, p.15-26, May 28-31, 1990, Seattle, Washington, United States
|
 |
18
|
|
| |
19
|
Michael Gschwind , H. Peter Hofstee , Brian Flachs , Martin Hopkins , Yukio Watanabe , Takeshi Yamazaki, Synergistic Processing in Cell's Multicore Architecture, IEEE Micro, v.26 n.2, p.10-24, March 2006
[doi> 10.1109/MM.2006.41]
|
| |
20
|
John L. Hennessy and David A. Patterson. Computer Architecture. Morgan Kaufmann, fourth edition, 2006.
|
 |
21
|
|
| |
22
|
IBM. Software Development Kit for Multicore Acceleration version 3.0, Programmer's Guide. IBM, 2007. http://www.ibm.com/developerworks/power/cell/.
|
| |
23
|
IBM, Sony, and Toshiba. Cell Broadband Engine Architecture. IBM, October 2007. http://www.ibm.com/developerworks/power/cell/.
|
 |
24
|
|
 |
25
|
|
| |
26
|
Pete Keleher , Alan L. Cox , Sandhya Dwarkadas , Willy Zwaenepoel, TreadMarks: distributed shared memory on standard workstations and operating systems, Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, p.10-10, January 17-21, 1994, San Francisco, California
|
| |
27
|
|
| |
28
|
|
 |
29
|
|
 |
30
|
|
| |
31
|
M. Morita, T. Machino, M. Guo, and G. Wang. Design and implementation of stream processing system and library for CELL broadband engine processors. In Proceedings of the 2007 Parallel and Distributed Computing and Systems Conference, November 2007.
|
| |
32
|
Kevin O'Brien , Kathryn O'Brien , Zehra Sura , Tong Chen , Tao Zhang, Supporting OpenMP on Cell, Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era, p.65-76, June 03-07, 2007, Beijing, China
[doi> 10.1007/978-3-540-69303-1_6]
|
| |
33
|
|
| |
34
|
Parallel and High Performance Applicational Software Exchange Editorial Committee. Omni OpenMP compiler project. http://phase.hpcc.jp/omni.
|
 |
35
|
|
| |
36
|
|
 |
37
|
Daniel J. Scales , Kourosh Gharachorloo , Chandramohan A. Thekkath, Shasta: a low overhead, software-only approach for supporting fine-grain shared memory, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.174-185, October 01-04, 1996, Cambridge, Massachusetts, United States
|
 |
38
|
Ioannis Schoinas , Babak Falsafi , Alvin R. Lebeck , Steven K. Reinhardt , James R. Larus , David A. Wood, Fine-grain access control for distributed shared memory, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.297-306, October 05-07, 1994, San Jose, California, United States
|
 |
39
|
Robert Stets , Sandhya Dwarkadas , Nikolaos Hardavellas , Galen Hunt , Leonidas Kontothanassis , Srinivasan Parthasarathy , Michael Scott, Cashmere-2L: software coherent shared memory on a clustered remote-write network, Proceedings of the sixteenth ACM symposium on Operating systems principles, p.170-183, October 05-08, 1997, Saint Malo, France
|
| |
40
|
HPC Challenge Team. HPC challenge benchmark. http://icl.cs.utk.edu/hpcc/.
|
| |
41
|
|
 |
42
|
Yuanyuan Zhou , Liviu Iftode , Kai Li, Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems, Proceedings of the second USENIX symposium on Operating systems design and implementation, p.75-88, October 29-November 01, 1996, Seattle, Washington, United States
|
CITED BY
|
|
Tao Liu , Haibo Lin , Tong Chen , John Kevin O'Brien , Ling Shao, DBDB: optimizing DMATransfer for the cell be architecture, Proceedings of the 23rd international conference on Supercomputing, June 08-12, 2009, Yorktown Heights, NY, USA
|
INDEX TERMS
Primary Classification:
C.
Computer Systems Organization
C.0
GENERAL
Subjects:
Hardware/software interfaces
Additional Classification:
D.
Software
D.3
PROGRAMMING LANGUAGES
D.3.4
Processors
Subjects:
Run-time environments
D.4
OPERATING SYSTEMS
D.4.2
Storage Management
Subjects:
Virtual memory
General Terms:
Algorithms,
Design,
Experimentation,
Languages,
Management,
Measurement,
Performance
Keywords:
Cell BE,
OpenMP,
heterogeneous multicores,
software distributed shared memory,
software shared virtual memory
|