|
ABSTRACT
In heterogeneous multi-core systems, such as the Cell BE or certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence. It is software's responsibility to dynamically transfer the working set when the total data set is too large to fit in the local memory. The data can be transferred through a software controlled cache which maintains correctness and exploits reuse among references, especially when complicated aliasing or data dependence exists. However, the software controlled cache introduces the extra overhead of cache lookup. In this paper we present the design and implementation of a Direct Blocking Data Buffer (DBDB) which combines compiler analysis and runtime management to optimize local memory utilization. We use compile time analysis to identify regular references in a loop body, block the innermost loop according to the access patterns and available local memory space, insert DMA operations for the blocked loop, and substitute references to local buffers. The runtime is responsible for allocating local memory for DBDB, especially for disambiguating aliased memory accesses which could not be resolved at compile time. We further optimize noncontiguous references by taking advantage of the DMA-list feature provided by the Cell BE. A practical performance model is presented to guide the DMA transfer scheme selection among single-DMA, multi-DMA and DMA-list. We have implemented DBDB in the IBM XL C/C++ for Multicore Acceleration for Linux, and have conducted experiments with selected test cases from the NAS OpenMP and SPEC benchmarks. The results show that our method performs well compared with traditional software cache approach. We have observed a speedup of up to 5.3x and 4x in average.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
B. Flachs, S. Asano, S. H. Dhong et al., "The microarchitecture of the synergistic processor for a Cell processor," IEEE Journal of Solid-State Circuits, vol. 41, no. 1, pp. 63--70, 2006.
|
| |
2
|
D. Pham, S. Asano, M. Bolliger et al., "The design and implementation of a first-generation Cell processor," in Proc. of IEEE International Solid-State Circuits Conference, 2005, pp. 184--592 Vol. 1.
|
 |
3
|
Rajeshwari Banakar , Stefan Steinke , Bo-Sik Lee , M. Balakrishnan , Peter Marwedel, Scratchpad memory: design alternative for cache on-chip memory in embedded systems, Proceedings of the tenth international symposium on Hardware/software codesign, May 06-08, 2002, Estes Park, Colorado
[doi> 10.1145/774789.774805]
|
 |
4
|
|
| |
5
|
A. E. Eichenberger , J. K. O'Brien , K. M. O'Brien , P. Wu , T. Chen , P. H. Oden , D. A. Prener , J. C. Shepherd , B. So , Z. Sura , A. Wang , T. Zhang , P. Zhao , M. K. Gschwind , R. Archambault , Y. Gao , R. Koo, Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture, IBM Systems Journal, v.45 n.1, p.59-84, January 2006
|
 |
6
|
|
 |
7
|
|
| |
8
|
H. Jin, M. Frumkin, and J. Yan, "The OpenMP implementation of NAS parallel benchmarks and its performance." NAS Technical Report NAS-99-011, NASA Ames Research Center, Moffett Field, CA, October, 1999.
|
| |
9
|
|
| |
10
|
|
| |
11
|
U. J. Kapasi, P. Mattson, W. J. Dally et al., "Stream scheduling," in Proc. of the 3rd Workshop on Media and Streaming Processors, 2001, pp. 101--106.
|
 |
12
|
Scott Schneider , Jae-Seung Yeom , Benjamin Rose , John C. Linford , Adrian Sandu , Dimitrios S. Nikolopoulos, A comparison of programming models for multiprocessors with explicitly managed memory hierarchies, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
 |
13
|
Kayvon Fatahalian , Daniel Reiter Horn , Timothy J. Knight , Larkhoon Leem , Mike Houston , Ji Young Park , Mattan Erez , Manman Ren , Alex Aiken , William J. Dally , Pat Hanrahan, Sequoia: programming the memory hierarchy, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, November 11-17, 2006, Tampa, Florida
[doi> 10.1145/1188455.1188543]
|
 |
14
|
Timothy J. Knight , Ji Young Park , Manman Ren , Mike Houston , Mattan Erez , Kayvon Fatahalian , Alex Aiken , William J. Dally , Pat Hanrahan, Compilation for explicitly managed memory hierarchies, Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, March 14-17, 2007, San Jose, California, USA
[doi> 10.1145/1229428.1229477]
|
| |
15
|
|
 |
16
|
Tong Chen , Tao Zhang , Zehra Sura , Mar Gonzales Tallada, Prefetching irregular references for software cache on cell, Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, April 05-09, 2008, Boston, MA, USA
[doi> 10.1145/1356058.1356079]
|
| |
17
|
|
 |
18
|
|
 |
19
|
Ioannis Schoinas , Babak Falsafi , Alvin R. Lebeck , Steven K. Reinhardt , James R. Larus , David A. Wood, Fine-grain access control for distributed shared memory, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.297-306, October 05-07, 1994, San Jose, California, United States
|
| |
20
|
|
| |
21
|
Osman S. Unsal , Raksit Ashok , Israel Koren , C. Mani Krishna , Csaba Andras Moritz, Cool-cache for hot multimedia, Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, December 01-05, 2001, Austin, Texas
|
| |
22
|
J. Balart, M. Gonzalez, X. Martorell et al., "A novel asynchronous software cache implementation for the cell-be processor," in Proc. of the 20th International Workshop on Languages and Compilers for Parallel Computing (LCPC'07), 2007.
|
 |
23
|
Jaejin Lee , Sangmin Seo , Chihun Kim , Junghyun Kim , Posung Chun , Zehra Sura , Jungwon Kim , SangYong Han, COMIC: a coherent shared memory interface for cell be, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
[doi> 10.1145/1454115.1454157]
|
| |
24
|
S. Seo, J. Lee, and Z. Sura, "Design and implementation of software-managed caches for multicores with local memory," in Proc. of the 15th International Symposium on High-Performance Computer Architecture (HPCA'09), 2009.
|
 |
25
|
Marc Gonzàlez , Nikola Vujic , Xavier Martorell , Eduard Ayguadé , Alexandre E. Eichenberger , Tong Chen , Zehra Sura , Tao Zhang , Kevin O'Brien , Kathryn O'Brien, Hybrid access-specific software cache techniques for the cell BE architecture, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
[doi> 10.1145/1454115.1454156]
|
| |
26
|
|
| |
27
|
T. Chen, Z. Sura, K. O'Brien, and K. O'Brien, "Optimizing the use of static buffers for DMA on a CELL chip," in Proc. of the International Workshop on Languages and Compilers for Parallel Computing (LCPC'06). Springer Berlin, 2006, pp. 314--329.
|
|