|
ABSTRACT
Future mainstream microprocessors will likely integrate specialized accelerators, such as GPUs, onto a single die to achieve better performance and power efficiency. However, it remains a keen challenge to program such a heterogeneous multicore platform, since these specialized accelerators feature ISAs and functionality that are significantly different from the general purpose CPU cores. In this paper, we present EXOCHI: (1) Exoskeleton Sequencer(EXO), an architecture to represent heterogeneous acceleratorsas ISA-based MIMD architecture resources, and a shared virtual memory heterogeneous multithreaded program execution model that tightly couples specialized accelerator cores with generalpurpose CPU cores, and (2) C for Heterogeneous Integration(CHI), an integrated C/C++ programming environment that supports accelerator-specific inline assembly and domain-specific languages. The CHI compiler extends the OpenMP pragma for heterogeneous multithreading programming, and produces a single fat binary with code sections corresponding to different instruction sets. The runtime can judiciously spread parallel computation across the heterogeneous cores to optimize performance and power. We have prototyped the EXO architecture on a physical heterogeneous platform consisting of an Intel® Core™ 2 Duo processor and an 8-core 32-thread Intel® Graphics Media Accelerator X3000. In addition, we have implemented the CHI integrated programming environment with the Intel® C++ Compiler, runtime toolset, and debugger. On the EXO prototype system, we have enhanced a suite of production-quality media kernels for video and image processing to utilize the accelerator through the CHI programming interface, achieving significant speedup (1.41X to10.97X) over execution on the IA32 CPU alone.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Ian Buck , Tim Foley , Daniel Horn , Jeremy Sugerman , Kayvon Fatahalian , Mike Houston , Pat Hanrahan, Brook for GPUs: stream computing on graphics hardware, ACM Transactions on Graphics (TOG), v.23 n.3, August 2004
|
| |
2
|
CPU+GPU integration. http://www.google.com/search?hl=en&lr=&rls=GGLG%2CGGLG%2005--47%2CGGLG3Aen&q=intel+amd+nvidia+ati+cpu+gpu+integrated+&btnG=Search.
|
| |
3
|
CUDA. http://developer.nvidia.com/object/cuda.html.
|
| |
4
|
P. Dubey. Recognition, Mining and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine, February 2005.
|
| |
5
|
Alexandre E. Eichenberger , Kathryn O'Brien , Kevin O'Brien , Peng Wu , Tong Chen , Peter H. Oden , Daniel A. Prener , Janice C. Shepherd , Byoungro So , Zehra Sura , Amy Wang , Tao Zhang , Peng Zhao , Michael Gschwind, Optimizing Compiler for the CELL Processor, Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, p.161-172, September 17-21, 2005
[doi> 10.1109/PACT.2005.33]
|
| |
6
|
GLSL OpenGL Shading Language. www.wikipedia.org/wiki/GLSL.
|
| |
7
|
|
 |
8
|
|
| |
9
|
GPGPU: General Purpose Computation using Graphics Hardware. www.gpgpu.org.
|
| |
10
|
E. Grochowski and M. Annavaram. Energy per Instruction Trends in Intel Microprocessors. Technology@Intel Magazine, March 2006.
|
 |
11
|
Richard A. Hankins , Gautham N. Chinya , Jamison D. Collins , Perry H. Wang , Ryan Rakvic , Hong Wang , John P. Shen, Multiple Instruction Stream Processor, Proceedings of the 33rd annual international symposium on Computer Architecture, p.114-127, June 17-21, 2006
|
| |
12
|
Intel G965 Express Chipset. http://www.intel.com/products/chipsets/g965/prod brief.pdf.
|
| |
13
|
Intel Santa Rosa Platform. http://www.intel.com/pressroom/archive/releases/20060307corp b.htm.
|
| |
14
|
Tera-scale Research Prototype: Connecting 80 Simple Sores on a Single Test Chip. ftp://download.intel.com/research/platform/terascale/tera-scaleresearchprototypebackgrounder.pdf.
|
| |
15
|
Intels Next Generation Integrated Graphics Architecture Intel Graphics Media Accelerator X3000 and 3000. Intel Corporation, 2006.
|
| |
16
|
Ujval J. Kapasi , Scott Rixner , William J. Dally , Brucek Khailany , Jung Ho Ahn , Peter Mattson , John D. Owens, Programmable Stream Processors, Computer, v.36 n.8, p.54-62, August 2003
[doi> 10.1109/MC.2003.1220582]
|
 |
17
|
Rakesh Kumar , Dean M. Tullsen , Parthasarathy Ranganathan , Norman P. Jouppi , Keith I. Farkas, Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance, Proceedings of the 31st annual international symposium on Computer architecture, p.64, June 19-23, 2004, München, Germany
|
| |
18
|
Francois Labonte , Peter Mattson , William Thies , Ian Buck , Christos Kozyrakis , Mark Horowitz, The Stream Virtual Machine, Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, p.267-277, September 29-October 03, 2004
[doi> 10.1109/PACT.2004.29]
|
 |
19
|
|
| |
20
|
|
 |
21
|
|
| |
22
|
J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. Lefohn, and T. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. In Eurographics, August 2005.
|
| |
23
|
The PeakStream Platform: High Productivity Software Development for Multi-core Processors. PeakStream Inc, 2006.
|
 |
24
|
|
| |
25
|
S. Shah, G. Haab, P. Petersen, and J. Throop. Flexible control structures for parallelism in OpenMP. In First European Workshop on OpenMP, September 1999.
|
| |
26
|
E. Su, X. Tian ,M. Girkar, G. Haab, S. Shah, and P. Petersen. Compiler Support of the Workqueuing Execution Model for Intel SMP Architectures. In Proceedings of the 4th European Workshop on OpenMP, 2002.
|
 |
27
|
|
| |
28
|
W. Thies,M. Karczmarek, and S. Amarasinghe. StreamIt: A Language for Streaming Applications. In Computational Complexity, 2002.
|
| |
29
|
X. Tian, A. Bik, M. Girkar, P. Grey, H. Saito, and E. Su. Intel OpenMP C++/Fortran Compiler for Hyper--Threading Technology: Implementation and Performance. Intel Technology Journal, Q1 2002.
|
| |
30
|
X. Tian, M. Girkar, S. Shah, D. Armstrong, E. Su, and P. Petersen. Compiler and Runtime Support for Running OpenMP Programs on Pentium and Itanium Architectures. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing, April 2003.
|
| |
31
|
O. Wechsler. Inside Intel Core Microarchitecture: Setting New Standards for Energy-efficient Performance. Technology@Intel Magazine, 2006.
|
| |
32
|
D. Zhang, Z. Li, H. Song, and L. Liu. A Programming Model for an Embedded Media Processing Architecture. In Embedded Computer Systems: Architecture, Modeling, and Simulation, 2005.
|
CITED BY 8
|
|
Perry H. Wang , Jamison D. Collins , Gautham N. Chinya , Bernard Lint , Asit Mallick , Koichi Yamada , Hong Wang, Sequencer virtualization, Proceedings of the 21st annual international conference on Supercomputing, June 17-21, 2007, Seattle, Washington
|
|
|
|
|
|
Shane Ryoo , Christopher I. Rodrigues , Sara S. Baghsorkhi , Sam S. Stone , David B. Kirk , Wen-mei W. Hwu, Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, February 20-23, 2008, Salt Lake City, UT, USA
|
|
|
Scott Schneider , Jae-Seung Yeom , Benjamin Rose , John C. Linford , Adrian Sandu , Dimitrios S. Nikolopoulos, A comparison of programming models for multiprocessors with explicitly managed memory hierarchies, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
|
|
Alastair D. Reid , Krisztian Flautner , Edmund Grimley-Evans , Yuan Lin, SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip, Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems, October 19-24, 2008, Atlanta, GA, USA
|
|
|
Henry Wong , Anne Bracy , Ethan Schuchman , Tor M. Aamodt , Jamison D. Collins , Perry H. Wang , Gautham Chinya , Ankur Khandelwal Groen , Hong Jiang , Hong Wang, Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
|
|
|
Bratin Saha , Xiaocheng Zhou , Hu Chen , Ying Gao , Shoumeng Yan , Mohan Rajagopalan , Jesse Fang , Peinan Zhang , Ronny Ronen , Avi Mendelson, Programming model for a heterogeneous x86 platform, ACM SIGPLAN Notices, v.44 n.6, June 2009
|
|
|
Jeremy S. Meredith , Gonzalo Alvarez , Thomas A. Maier , Thomas C. Schulthess , Jeffrey S. Vetter, Accuracy and performance of graphics processors: A Quantum Monte Carlo application case study, Parallel Computing, v.35 n.3, p.151-163, March, 2009
|
|