ACM Home Page
Please provide us with feedback. Feedback
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
Full text PdfPdf (2.41 MB)
Source
ACM Transactions on Architecture and Code Optimization (TACO) archive
Volume 6 ,  Issue 2  (June 2009) table of contents
Article No. 7  
Year of Publication: 2009
ISSN:1544-3566
Authors
Wilson W. L. Fung  University of British Columbia, Vancouver, B.C., Canada
Ivan Sham  University of British Columbia, Vancouver, B.C., Canada
George Yuan  University of British Columbia, Vancouver, B.C., Canada
Tor M. Aamodt  University of British Columbia, Vancouver, B.C., Canada
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 61,   Downloads (12 Months): 206,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1543753.1543756
What is a DOI?

ABSTRACT

Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today's desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported in hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this article, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by 13%, on average, with 256 threads per core, 24% with 512 threads, and 47% with 768 threads for an estimated area increase of 8%.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
3
 
4
AMD, Inc. 2006. ATI CTM Guide, 1.01 ed. AMD, Inc.
 
5
6
 
7
Bouknight, W., Denenberg, S., McIntyre, D., Randall, J., Sameh, A., and Slotnick, D. 1972. The Illiac IV System. Proc. IEEE 60, 4, 369--388.
 
8
9
 
10
Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0. http://www.simplescalar.com.
 
11
Cervini, S. 2005. European Patent EP 1531391 A2: System and method for efficiently executing single program multiple data (SPMD) programs.
 
12
 
13
14
 
15
 
16
Dally, W. J. and Towles, B. 2004. Interconnection Networks. Morgan Kaufmann, San Francisco, CA.
 
17
del Barrio, V., Gonzalez, C., Roca, J., Fernandez, A., and Espasa, R. 2006. ATTILA: A cycle-level execution-driven simulator for modern GPU architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 231--241.
 
18
Fowler, H., Fowler, F., and Thompson, D., Eds. 1995. The Concise Oxford Dictionary 9th Ed. Oxford University Press.
 
19
20
 
21
Harris, M. 2007. Optimizing parallel reduction in cuda. http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf.
 
22
Hwu, W.-M., Kirk, D., Ryoo, S., Rodrigues, C., Stratton, J., and Huang, K. 2007. Performance insights on executing non-graphics applications on CUDA on the NVIDIA GeForce 8800 GTX. http://www.hotchips.org/archives/hc19/2Mon/HC19.02/HC19.02.03.pdf.
 
23
Intel Corp. 2008. Intel 965 Express Chipset Family and Intel G35 Express Chipset Graphics Controller Programmer's Reference Manual. Intel Corporation.
 
24
 
25
26
27
28
29
 
30
 
31
Lorie, R. A. and Strong, H. R. 1984. US Patent 4,435,758: Method for conditional branch execution in SIMD vector processors.
 
32
 
33
 
34
Moy, S. and Lindholm, E. 2005. US Patent 6,947,047: Method and system for programmable pipelined graphics processing with branching instructions.
 
35
 
36
Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequences of tow proteins. Mol. Biol. 48, 443--453.
 
37
NVIDIA Corp. CUDA SDK code samples. http://www.nvidia.com/object/cudagetsamples.html.
 
38
NVIDIA Corp. 2007a. NVIDIA CUDA Programming Guide, 1.1 ed. NVIDIA Corp.
 
39
NVIDIA Corp. 2007b. PTX: Parallel Thread Execution ISA, 1.1 ed. NVIDIA Corp.
40
 
41
42
 
43
44
 
45
Shebanow, M. 2007. ECE 498 AL: Programming massively parallel processors (lecture 12). http://courses.ece.uiuc.edu/ece498/al1/Archive/Spring2007.
 
46
 
47
Standard Performance Evaluation Corporation. SPEC CPU2006 benchmarks. http://www.spec.org/cpu2006/.
 
48
Tarjan, D., Thoziyor, S., and Jouppi, N. P. 2006. CACTI 4.0. Tech. rep. HPL-2006-86, Hewlett Packard Laboratories, Palo Alto, CA.
 
49
50
 
51
52
53

Collaborative Colleagues:
Wilson W. L. Fung: colleagues
Ivan Sham: colleagues
George Yuan: colleagues
Tor M. Aamodt: colleagues