ACM Home Page
Please provide us with feedback. Feedback
A control-structure splitting optimization for GPGPU
Full text PdfPdf (612 KB)
Source
Conference On Computing Frontiers archive
Proceedings of the 6th ACM conference on Computing frontiers table of contents
Ischia, Italy
SESSION: Innovative acceleration platforms table of contents
Pages 147-150  
Year of Publication: 2009
ISBN:978-1-60558-413-3
Authors
Snaider Carrillo  University of Delaware, Newark, USA
Jakob Siegel  University of Delaware, Newark, USA
Xiaoming Li  University of Delaware, Newark, USA
Sponsors
ACM: Association for Computing Machinery
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 35,   Downloads (12 Months): 99,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1531743.1531766
What is a DOI?

ABSTRACT

Control statements in a GPU program such as loops and branches pose serious challenges for the efficient usage of GPU resources because those control statements will lead to the serialization of threads and consequently ruin the occupancy of GPU, that is, the number of threads running concurrently. Unlike traditional vector processing units that are inside a general purpose processor, the GPU cannot leave the control statements to the CPU because fine-grain statement scheduling between GPU and CPU is impossible. We need an effective method to handle the control statements "just in place" on the GPUs.

In this paper, we propose novel techniques to transform control statements so that they can be executed efficiently on GPUs. Our techniques smartly increase code redundancy, which might be deemed as "de-optimization" for CPU, to improve the occupancy of a program on GPU and therefore improve performance. We focus our attention on how common programming structures such as loops and branches decrease the occupancy of single kernels and how to counter that. We demonstrate our optimizations on a synthetic benchmark and a complex parallel algorithm, the Lattice Boltzmann Method (LBM). Our results show that these techniques are very efficient and can lead to an increase in occupancy and a drastic improvement in performance compared to non-split version of the programs.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
C. NVIDIA. Compute Unified Device Architecture Programming Guide. NVIDIA: Santa Clara, CA, 2007.
2
 
3
S. Succi. The Lattice Boltzmann Equation for Fluid Dynamics and Beyond.2001.
 
4

Collaborative Colleagues:
Snaider Carrillo: colleagues
Jakob Siegel: colleagues
Xiaoming Li: colleagues