ACM Home Page
Please provide us with feedback. Feedback
An efficient in-place 3D transpose for multicore processors with software managed memory hierarchy
Full text PdfPdf (171 KB)
Source ACM International Conference Proceeding Series; Vol. 356 archive
Proceedings of the 1st international forum on Next-generation multicore/manycore technologies table of contents
Cairo, Egypt
SESSION: Memory hierarchy table of contents
Article No. 10  
Year of Publication: 2008
ISBN:978-1-60558-407-2
Authors
Ali El-Moursy  Electronics Research Institute, Giza, Egypt
Ahmed El-Mahdy  Alexandria University, Alexandria, Egypt
Hisham El-Shishiny  IBM Centre for Advanced Studies in Cairo, IBM WTC, El-Ahram, Giza, Egypt
Sponsors
IBM : IBM
: IBM Center for Advanced Studies, Cairo, Egypt
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 117,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1463768.1463781
What is a DOI?

ABSTRACT

3D transpose is an important operation in many large scale scientific applications such as seismic and medical imaging. This paper proposes a novel algorithm for fast in-place 3D transpose operation. The algorithm exploits Single Instruction Multiple Data (SIMD) multicore architecture with software managed memory hierarchy. Such architectural features are present in the next generation processors, such as the Cell Broadband Engine (Cell BE) processor. The algorithm performs transposition at two levels of granularity: at coarse level, where logical transposition is done by merely transposing the address map at each access; and at a fine grain level, where physical transposition is done by actual element displacement/swapping. Such mix combines the benefits of allowing for fast on-chip bandwidth by providing for large transfer sizes, and at the same time allows for fine-grain SIMD operations. The transfer rate is further enhanced by allowing for batch transposing spatially joined data along a major axis. Results on the Cell BE processor show substantial utilisation of on-chip communication bandwidth, and negligible processing time.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Jr. C. W. Carter, and Eds R. M. Sweet. Processing of X-ray Diffraction Data Collected in Oscillation Mode, Methods in Enzymology. Macromolecular Crystallography, Volume 276: part A, p.307--326, 1997.
 
2
J. Choi, J. J. Dongarra, and D. W. Walker. Parallel matrix transpose algorithms on distributed memory concurrent computers. In Proceedings of Scalable Parallel Libraries Conference, 1993.
 
3
H. S. Cohl, X.-H. Sun, and J. E. Tohline. Parallel implementation of a data-transpose technique for the solution of Poisson's equation in cylindrical coordinates". In Proceedings of the Eighth SIAM conference on Parallel Processing for Scientific Computing. PPSC, Math 14--17 1997.
 
4
M. Elefheriou, B. G. Fitch, R. S. Germain, A. Rayshubskiy, and T. J. C. Ward. Multi-dimensional transpose for distributed memory network. US Patent 2006/0010181, January 2006.
 
5
IBM. Cell Broadband Engine Architecture, October 2006. version 1.01.
 
6
M. Kistler, M. Perrone, and F. Petrini. Cell Multiprocessor Communication Network: Built for Speed IEEE MICRO., May-June 2006.
 
7

Collaborative Colleagues:
Ali El-Moursy: colleagues
Ahmed El-Mahdy: colleagues
Hisham El-Shishiny: colleagues