ACM Home Page
Please provide us with feedback. Feedback
DBDB: optimizing DMATransfer for the cell be architecture
Full text PdfPdf (839 KB)
Source
International Conference on Supercomputing archive
Proceedings of the 23rd international conference on Supercomputing table of contents
Yorktown Heights, NY, USA
SESSION: Applications of the cell processor table of contents
Pages 36-45  
Year of Publication: 2009
ISBN:978-1-60558-498-0
Authors
Tao Liu  IBM China Research Laboratory, Beijing, China
Haibo Lin  IBM China Research Laboratory, Beijing, China
Tong Chen  IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
John Kevin O'Brien  IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Ling Shao  IBM China Research Laboratory, Beijing, China
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 21,   Downloads (12 Months): 86,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1542275.1542286
What is a DOI?

ABSTRACT

In heterogeneous multi-core systems, such as the Cell BE or certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence. It is software's responsibility to dynamically transfer the working set when the total data set is too large to fit in the local memory. The data can be transferred through a software controlled cache which maintains correctness and exploits reuse among references, especially when complicated aliasing or data dependence exists. However, the software controlled cache introduces the extra overhead of cache lookup. In this paper we present the design and implementation of a Direct Blocking Data Buffer (DBDB) which combines compiler analysis and runtime management to optimize local memory utilization. We use compile time analysis to identify regular references in a loop body, block the innermost loop according to the access patterns and available local memory space, insert DMA operations for the blocked loop, and substitute references to local buffers. The runtime is responsible for allocating local memory for DBDB, especially for disambiguating aliased memory accesses which could not be resolved at compile time. We further optimize noncontiguous references by taking advantage of the DMA-list feature provided by the Cell BE. A practical performance model is presented to guide the DMA transfer scheme selection among single-DMA, multi-DMA and DMA-list. We have implemented DBDB in the IBM XL C/C++ for Multicore Acceleration for Linux, and have conducted experiments with selected test cases from the NAS OpenMP and SPEC benchmarks. The results show that our method performs well compared with traditional software cache approach. We have observed a speedup of up to 5.3x and 4x in average.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
B. Flachs, S. Asano, S. H. Dhong et al., "The microarchitecture of the synergistic processor for a Cell processor," IEEE Journal of Solid-State Circuits, vol. 41, no. 1, pp. 63--70, 2006.
 
2
D. Pham, S. Asano, M. Bolliger et al., "The design and implementation of a first-generation Cell processor," in Proc. of IEEE International Solid-State Circuits Conference, 2005, pp. 184--592 Vol. 1.
3
4
 
5
6
7
 
8
H. Jin, M. Frumkin, and J. Yan, "The OpenMP implementation of NAS parallel benchmarks and its performance." NAS Technical Report NAS-99-011, NASA Ames Research Center, Moffett Field, CA, October, 1999.
 
9
 
10
 
11
U. J. Kapasi, P. Mattson, W. J. Dally et al., "Stream scheduling," in Proc. of the 3rd Workshop on Media and Streaming Processors, 2001, pp. 101--106.
12
13
14
 
15
16
 
17
18
19
 
20
 
21
 
22
J. Balart, M. Gonzalez, X. Martorell et al., "A novel asynchronous software cache implementation for the cell-be processor," in Proc. of the 20th International Workshop on Languages and Compilers for Parallel Computing (LCPC'07), 2007.
23
 
24
S. Seo, J. Lee, and Z. Sura, "Design and implementation of software-managed caches for multicores with local memory," in Proc. of the 15th International Symposium on High-Performance Computer Architecture (HPCA'09), 2009.
25
 
26
 
27
T. Chen, Z. Sura, K. O'Brien, and K. O'Brien, "Optimizing the use of static buffers for DMA on a CELL chip," in Proc. of the International Workshop on Languages and Compilers for Parallel Computing (LCPC'06). Springer Berlin, 2006, pp. 314--329.

Collaborative Colleagues:
Tao Liu: colleagues
Haibo Lin: colleagues
Tong Chen: colleagues
John Kevin O'Brien: colleagues
Ling Shao: colleagues