|
ABSTRACT
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today's desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported in hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this article, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by 13%, on average, with 256 threads per core, 24% with 512 threads, and 47% with 768 threads for an estimated area increase of 8%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Vikas Agarwal , M. S. Hrishikesh , Stephen W. Keckler , Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th annual international symposium on Computer architecture, p.248-259, June 2000, Vancouver, British Columbia, Canada
|
| |
2
|
Alfred V. Aho , Ravi Sethi , Jeffrey D. Ullman, Compilers: principles, techniques, and tools, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1986
|
 |
3
|
J. R. Allen , Ken Kennedy , Carrie Porterfield , Joe Warren, Conversion of control dependence to data dependence, Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, p.177-189, January 24-26, 1983, Austin, Texas
[doi> 10.1145/567067.567085]
|
| |
4
|
AMD, Inc. 2006. ATI CTM Guide, 1.01 ed. AMD, Inc.
|
| |
5
|
|
 |
6
|
|
| |
7
|
Bouknight, W., Denenberg, S., McIntyre, D., Randall, J., Sameh, A., and Slotnick, D. 1972. The Illiac IV System. Proc. IEEE 60, 4, 369--388.
|
| |
8
|
|
 |
9
|
Ian Buck , Tim Foley , Daniel Horn , Jeremy Sugerman , Kayvon Fatahalian , Mike Houston , Pat Hanrahan, Brook for GPUs: stream computing on graphics hardware, ACM SIGGRAPH 2004 Papers, August 08-12, 2004, Los Angeles, California
|
| |
10
|
Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0. http://www.simplescalar.com.
|
| |
11
|
Cervini, S. 2005. European Patent EP 1531391 A2: System and method for efficiently executing single program multiple data (SPMD) programs.
|
| |
12
|
|
| |
13
|
|
 |
14
|
|
| |
15
|
William J. Dally , Francois Labonte , Abhishek Das , Patrick Hanrahan , Jung-Ho Ahn , Jayanth Gummaraju , Mattan Erez , Nuwan Jayasena , Ian Buck , Timothy J. Knight , Ujval J. Kapasi, Merrimac: Supercomputing with Streams, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p.35, November 15-21, 2003
|
| |
16
|
Dally, W. J. and Towles, B. 2004. Interconnection Networks. Morgan Kaufmann, San Francisco, CA.
|
| |
17
|
del Barrio, V., Gonzalez, C., Roca, J., Fernandez, A., and Espasa, R. 2006. ATTILA: A cycle-level execution-driven simulator for modern GPU architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 231--241.
|
| |
18
|
Fowler, H., Fowler, F., and Thompson, D., Eds. 1995. The Concise Oxford Dictionary 9th Ed. Oxford University Press.
|
| |
19
|
|
 |
20
|
|
| |
21
|
Harris, M. 2007. Optimizing parallel reduction in cuda. http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf.
|
| |
22
|
Hwu, W.-M., Kirk, D., Ryoo, S., Rodrigues, C., Stratton, J., and Huang, K. 2007. Performance insights on executing non-graphics applications on CUDA on the NVIDIA GeForce 8800 GTX. http://www.hotchips.org/archives/hc19/2Mon/HC19.02/HC19.02.03.pdf.
|
| |
23
|
Intel Corp. 2008. Intel 965 Express Chipset Family and Intel G35 Express Chipset Graphics Controller Programmer's Reference Manual. Intel Corporation.
|
| |
24
|
|
| |
25
|
|
 |
26
|
Ujval J. Kapasi , William J. Dally , Scott Rixner , Peter R. Mattson , John D. Owens , Brucek Khailany, Efficient conditional operations for data-parallel architectures, Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, p.159-170, December 2000, Monterey, California, United States
[doi> 10.1145/360128.360145]
|
 |
27
|
Ronny Krashinsky , Christopher Batten , Mark Hampton , Steve Gerding , Brian Pharris , Jared Casper , Krste Asanovic, The Vector-Thread Architecture, Proceedings of the 31st annual international symposium on Computer architecture, p.52, June 19-23, 2004, München, Germany
|
 |
28
|
|
 |
29
|
|
| |
30
|
|
| |
31
|
Lorie, R. A. and Strong, H. R. 1984. US Patent 4,435,758: Method for conditional branch execution in SIMD vector processors.
|
| |
32
|
|
| |
33
|
|
| |
34
|
Moy, S. and Lindholm, E. 2005. US Patent 6,947,047: Method and system for programmable pipelined graphics processing with branching instructions.
|
| |
35
|
|
| |
36
|
Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequences of tow proteins. Mol. Biol. 48, 443--453.
|
| |
37
|
NVIDIA Corp. CUDA SDK code samples. http://www.nvidia.com/object/cudagetsamples.html.
|
| |
38
|
NVIDIA Corp. 2007a. NVIDIA CUDA Programming Guide, 1.1 ed. NVIDIA Corp.
|
| |
39
|
NVIDIA Corp. 2007b. PTX: Parallel Thread Execution ISA, 1.1 ed. NVIDIA Corp.
|
 |
40
|
|
| |
41
|
Scott Rixner , William J. Dally , Ujval J. Kapasi , Brucek Khailany , Abelardo López-Lagunas , Peter R. Mattson , John D. Owens, A bandwidth-efficient architecture for media processing, Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, p.3-13, November 1998, Dallas, Texas, United States
|
 |
42
|
Scott Rixner , William J. Dally , Ujval J. Kapasi , Peter Mattson , John D. Owens, Memory access scheduling, Proceedings of the 27th annual international symposium on Computer architecture, p.128-138, June 2000, Vancouver, British Columbia, Canada
|
| |
43
|
|
 |
44
|
|
| |
45
|
Shebanow, M. 2007. ECE 498 AL: Programming massively parallel processors (lecture 12). http://courses.ece.uiuc.edu/ece498/al1/Archive/Spring2007.
|
| |
46
|
|
| |
47
|
Standard Performance Evaluation Corporation. SPEC CPU2006 benchmarks. http://www.spec.org/cpu2006/.
|
| |
48
|
Tarjan, D., Thoziyor, S., and Jouppi, N. P. 2006. CACTI 4.0. Tech. rep. HPL-2006-86, Hewlett Packard Laboratories, Palo Alto, CA.
|
| |
49
|
|
 |
50
|
|
| |
51
|
|
 |
52
|
Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, The SPLASH-2 programs: characterization and methodological considerations, Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy
|
 |
53
|
|
|