|
ABSTRACT
Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs). With superior computing power and high memory bandwidth, GPUs have become a commodity multiprocessor platform for general-purpose high-performance computing. However, due to the random access nature of gather and scatter, a naive implementation of the two operations suffers from a low utilization of the memory bandwidth and consequently a long, unhidden memory latency. Additionally, the architectural details of the GPUs, in particular, the memory hierarchy design, are unclear to the programmers. Therefore, we design multi-pass gather and scatter operations to improve their data access locality, and develop a performance model to help understand and optimize these two operations. We have evaluated our algorithms in sorting, hashing, and the sparse matrix-vector multiplication in comparison with their optimized CPU counterparts. Our results show that these optimizations yield 2--4X improvement on the GPU bandwidth utilization and 30--50% improvement on the response time. Overall, our optimized GPU implementations are 2--7X faster than their optimized CPU counterparts.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
CWI Calibrator, http://monetdb.cwi.nl/Calibrator/.
|
| |
2
|
HPF, http://hpff.rice.edu/.
|
| |
3
|
Intel Math Kernel Library 9.0, http://www.intel.com/cd/software/products/asmona/eng/266853.htm.
|
| |
4
|
MPI, http://www.mpi-forum.org/docs/.
|
| |
5
|
NAS parallel benchmarks, http://www.nas.nasa.gov/Resources/Software/npb.html.
|
| |
6
|
NVIDIA CUDA (Compute Unified Device Architecture), http://developer.nvidia.com/object/cuda.html.
|
| |
7
|
Sparse matrix collection, http://www.cise.ufl.edu/research/sparse/matrices/.
|
| |
8
|
ZPL, http://www.cs.washington.edu/research/zpl/home/.
|
| |
9
|
M. Abramowitz and I. A. Stegun (Eds.). Stirling numbers of the second kind. In Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. New York: Dover, 1972.
|
 |
10
|
|
| |
11
|
|
| |
12
|
G. E. Blelloch. Prefix sums and their applications. Technical report, CMU-CS-90-190, Nov 1990.
|
| |
13
|
|
| |
14
|
I. Buck. Taking the plunge into GPU computing. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, pp. 509--519.
|
 |
15
|
|
 |
16
|
|
| |
17
|
|
| |
18
|
H. Gahvari. Benchmarking sparse matrix-vector multiply. Master thesis, Computer Science Division, U.C. Berkeley, December 2006.
|
| |
19
|
|
 |
20
|
Naga Govindaraju , Jim Gray , Ritesh Kumar , Dinesh Manocha, GPUTeraSort: high performance graphics co-processor sorting for large database management, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142511]
|
 |
21
|
|
 |
22
|
Naga K. Govindaraju , Brandon Lloyd , Wei Wang , Ming Lin , Dinesh Manocha, Fast computation of database operations using graphics processors, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007594]
|
 |
23
|
|
| |
24
|
D. Horn. Lib GPU FFT, http://sourceforge.net/proiects/gpufft/, 2006.
|
 |
25
|
|
| |
26
|
|
 |
27
|
|
| |
28
|
|
| |
29
|
J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, Volume 26, 2007.
|
| |
30
|
Timothy J. Purcell , Craig Donner , Mike Cammarano , Henrik Wann Jensen , Pat Hanrahan, Photon mapping on programmable graphics hardware, Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, July 26-27, 2003, San Diego, California
|
| |
31
|
|
 |
32
|
|
| |
33
|
S. Williams, J. Shalf, L. Oliker, P. Husbands and K. Yelick, Dense and sparse matrix operations on the Cell processor. May, 2005. Lawrence Berkeley National Laboratory. Paper LBNL-58253.
|
 |
34
|
|
CITED BY 6
|
|
|
|
|
Bingsheng He , Ke Yang , Rui Fang , Mian Lu , Naga Govindaraju , Qiong Luo , Pedro Sander, Relational joins on graphics processors, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
|
|
|
|
|
|
Bingsheng He , Wenbin Fang , Qiong Luo , Naga K. Govindaraju , Tuyong Wang, Mars: a MapReduce framework on graphics processors, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
|
|
|
Perhaad Mistry , Sherman Braganza , David Kaeli , Miriam Leeser, Accelerating phase unwrapping and affine transformations for optical quadrature microscopy using CUDA, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, p.28-37, March 08-08, 2009, Washington, D.C.
|
|
|
Wenbin Fang , Mian Lu , Xiangye Xiao , Bingsheng He , Qiong Luo, Frequent itemset mining on graphics processors, Proceedings of the Fifth International Workshop on Data Management on New Hardware, June 28-28, 2009, Providence, Rhode Island
|
|