| CUBA: an architecture for efficient CPU/co-processor data communication |
| Full text |
Pdf
(393 KB)
|
Source
|
International Conference on Supercomputing
archive
Proceedings of the 22nd annual international conference on Supercomputing
table of contents
Island of Kos, Greece
SESSION: Memory management
table of contents
Pages 299-308
Year of Publication: 2008
ISBN:978-1-60558-158-3
|
|
Authors
|
|
Isaac Gelado
|
Universitat Politecnica de Catalunya, Barcelona, Spain
|
|
John H. Kelm
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
Shane Ryoo
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
Steven S. Lumetta
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
Nacho Navarro
|
Universität Politecnica de Catalunya, Barcelona, Spain
|
|
Wen-mei W. Hwu
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 22, Downloads (12 Months): 216, Citation Count: 1
|
|
|
ABSTRACT
Data-parallel co-processors have the potential to improve performance in highly parallel regions of code when coupled to a general-purpose CPU. However, applications often have to be modified in non-intuitive and complicated ways to mitigate the cost of data marshalling between the CPU and the co-processor. In some applications the overheads cannot be amortized and co-processors are unable to provide benefit. The additional effort and complexity of incorporating co-processors makes it difficult, if not impossible, to effectively utilize co-processors in large applications. This paper presents CUBA, an architecture model where co-processors encapsulated as function calls can efficiently access their input and output data structures through pointer parameters. The key idea is to map the data structures required by the co-processor to the co-processor local memory as opposed to the CPU's main memory. The mapping in CUBA preserves the original layout of the shared data structures hosted in the co-processor local memory. The mapping renders the data marshalling process unnecessary and reduces the need for code changes in order to use the co-processors. CUBA allows the CPU to cache hosted data structures with a selective write-through cache policy, allowing the CPU to access hosted data structures while supporting efficient communication with the co-processors. Benchmark simulation results show that a CUBA-based system can approach optimal transfer rates while requiring few changes to the code that executes on the CPU.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
AMD Staff. AMD64 Architecture Programmer's Manual. AMD Corporation, Sept. 2006.
|
| |
2
|
D. Anderson. Hyper-Transport System Architecture. Addison-Wesley Professional, 2003.
|
| |
3
|
R. Enzler, M. Platzer, C. Plessl, L. Thiele, and G. Troester. Reconfigurable processors for handhelds and wearables: Application analysis. In Reconfigurable Technology, pages 135146, Denver, CO, USA, Aug. 2001.
|
| |
4
|
M. Fahey, S. Alam, T. Dunigan Jr, J. Vetter, and P. Worley. Early Evaluation of the Cray XD1. Cray User Group Conference, 2005.
|
| |
5
|
Michael Gschwind , H. Peter Hofstee , Brian Flachs , Martin Hopkins , Yukio Watanabe , Takeshi Yamazaki, Synergistic Processing in Cell's Multicore Architecture, IEEE Micro, v.26 n.2, p.10-24, March 2006
[doi> 10.1109/MM.2006.41]
|
 |
6
|
|
 |
7
|
Zhi Guo , Walid Najjar , Frank Vahid , Kees Vissers, A quantitative analysis of the speedup factors of FPGAs over processors, Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays, February 22-24, 2004, Monterey, California, USA
[doi> 10.1145/968280.968304]
|
| |
8
|
|
| |
9
|
|
| |
10
|
M. Hummel, M. Krause, and D. O'Flaherty. AMD and HP: Protocol enhacements for tightly coupled accelerators. Press Release, 2007.
|
| |
11
|
Intel Staff. Intel 64 and IA-32 Architectures Software Developer's Manuals. Intel, May 2007.
|
 |
12
|
|
| |
13
|
John H. Kelm , Isaac Gelado , Mark J. Murphy , Nacho Navarro , Steve Lumetta , Wen-mei Hwu, CIGAR: Application Partitioning for a CPU/Coprocessor Architecture, Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, p.317-326, September 15-19, 2007
[doi> 10.1109/PACT.2007.21]
|
| |
14
|
|
 |
15
|
D. A. Koufaty , X. Chen , D. K. Poulsen , J. Torrellas, Data forwarding in scalable shared-memory multiprocessors, Proceedings of the 9th international conference on Supercomputing, p.255-264, July 03-07, 1995, Barcelona, Spain
[doi> 10.1145/224538.224569]
|
| |
16
|
MIPS Staff. MIPS32 Architecture for Programmers. MIPS Technologies, Mar. 2001.
|
| |
17
|
J. Renau, B. Fragela, J. Tuck, W. Liu, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator. http://sesc.sourceforge.net, Jan. 2005.
|
 |
18
|
Shane Ryoo , Christopher I. Rodrigues , Sara S. Baghsorkhi , Sam S. Stone , David B. Kirk , Wen-mei W. Hwu, Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, February 20-23, 2008, Salt Lake City, UT, USA
[doi> 10.1145/1345206.1345220]
|
| |
19
|
|
| |
20
|
Hartej Singh , Ming-Hau Lee , Guangming Lu , Nader Bagherzadeh , Fadi J. Kurdahi , Eliseu M. Chaves Filho, MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications, IEEE Transactions on Computers, v.49 n.5, p.465-481, May 2000
[doi> 10.1109/12.859540]
|
| |
21
|
Xilinx. Virtex-II Pro and Virtex-II Pro X Plaform FPGAs: Complete Data Sheet, Oct. 2005.
|
CITED BY
|
|
Bratin Saha , Xiaocheng Zhou , Hu Chen , Ying Gao , Shoumeng Yan , Mohan Rajagopalan , Jesse Fang , Peinan Zhang , Ronny Ronen , Avi Mendelson, Programming model for a heterogeneous x86 platform, ACM SIGPLAN Notices, v.44 n.6, June 2009
|
|