|
ABSTRACT
Modern GPUs offer much computing power at a very modest cost. Even though CUDA and other related recent developments are accelerating the use of GPUs for general purpose applications, several challenges still remain in programming the GPUs. Thus, it is clearly desirable to be able to program GPUs using a higher-level interface. In this paper, we offer a solution that targets a specific class of applications, which are the data mining and scientific data analysis applications. Our work is driven by the observation that a common processing structure, that of generalized reductions, fits a large number of popular data mining algorithms. In our solution, the programmers simply need to specify the sequential reduction loop(s) with some additional information about the parameters. We use program analysis and code generation to map the applications to a GPU. Several additional optimizations are also performed by the system. We have evaluated our system using three popular data mining applications, k-means clustering, EM clustering, and Principal Component Analysis (PCA). The main observations from our experiments are as follows. The speedup that each of these applications achieve over a sequential CPU version ranges between 20 and 50. The automatically generated version did not have any noticeable overheads compared to hand written codes. Finally, the optimizations performed in the system resulted in significant performance improvements.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
P. Anderson, D. Binkley, G. Rosay, and T. Teitelbaum. Flow Insensitive Points-To Sets. scam, 00:0081, 2001.
|
| |
4
|
Sara Baghsorkhi, Melvin Lathara, and Wen mei Hwu. CUDA-lite: Reducing GPU Programming Complexity. In LCPC 2008, 2008.
|
| |
5
|
Prithviraj Banerjee , John A. Chandy , Manish Gupta , Eugene W. Hodges IV , John G. Holm , Antonio Lain , Daniel J. Palermo , Shankar Ramaswamy , Ernesto Su, The Paradigm Compiler for Distributed-Memory Multicomputers, Computer, v.28 n.10, p.37-47, October 1995
[doi> 10.1109/2.467577]
|
 |
6
|
Muthu Manikandan Baskaran , Uday Bondhugula , Sriram Krishnamoorthy , J. Ramanujam , Atanas Rountev , P. Sadayappan, A compiler framework for optimization of affine loop nests for gpgpus, Proceedings of the 22nd annual international conference on Supercomputing, June 07-12, 2008, Island of Kos, Greece
[doi> 10.1145/1375527.1375562]
|
| |
7
|
William Blume , Ramon Doallo , Rudolf Eigenmann , John Grout , Jay Hoeflinger , Thomas Lawrence , Jaejin Lee , David Padua , Yunheung Paek , Bill Pottenger , Lawrence Rauchwerger , Peng Tu, Parallel Programming with Polaris, Computer, v.29 n.12, p.78-82, December 1996
[doi> 10.1109/2.546612]
|
| |
8
|
Randal E. Bryant. Data-Intensive Supercomputing: The Case for DISC. Technical Report CMU-CS-07-128, School of Computer Science, Carnegie Mellon University, 2007.
|
| |
9
|
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Mike, and H. Pat. Brook for GPUs: Stream Computing on Graphics Hardware, 2004.
|
| |
10
|
Benjamin Bustos, Oliver Deussen, Stefan Hiller, and Daniel Keim. A Graphics Hardware Accelerated Algorithm for Nearest Neighbor Search. In Vassil N. Alexandrov, Geert Dick van Albada, Peter M. A. Sloot, and Jack Dongarra, editors, Computational Science-ICCS 2006, volume 3994 of LNCS, pages 196--199. Springer, 2006.
|
| |
11
|
Maria Charalambous, Pedro Trancoso, and Alexandros Stamatakis. Initial experiences porting a bioinformatics application to a graphics processor. In Panhellenic Conference on Informatics, pages 415--425, 2005.
|
| |
12
|
Shuai Che, Jiayuan Meng, and Jeremy W. Sheaffer. A Performance Study of General Purpose Applications on Graphics Processors.
|
| |
13
|
|
| |
14
|
Matthias Christen, Olaf Schenk, and Helmar Burkhart. General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform. In First Workshop on General Purpose Processing on Graphics Processing Units, Oct 2007.
|
| |
15
|
|
| |
16
|
Arthur Dempster, Nan Laird, and Donald Rubin. Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39(1):1--38, 1977.
|
| |
17
|
|
| |
18
|
Vincent Garcia, Eric Debreuve, and Michel Barlaud. Fast k Nearest Neighbor Search using GPU, 2008.
|
 |
19
|
Naga Govindaraju , Jim Gray , Ritesh Kumar , Dinesh Manocha, GPUTeraSort: high performance graphics co-processor sorting for large database management, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142511]
|
 |
20
|
|
| |
21
|
Jesse D. Hall and John C. Hart. GPU Acceleration of Iterative Clustering. Jun 2004.
|
| |
22
|
Mary W. Hall , Jennifer M. Anderson , Saman P. Amarasinghe , Brian R. Murphy , Shih-Wei Liao , Edouard Bugnion , Monica S. Lam, Maximizing Multiprocessor Performance with the SUIF Compiler, Computer, v.29 n.12, p.84-89, December 1996
[doi> 10.1109/2.546613]
|
| |
23
|
|
| |
24
|
|
 |
25
|
Bingsheng He , Wenbin Fang , Qiong Luo , Naga K. Govindaraju , Tuyong Wang, Mars: a MapReduce framework on graphics processors, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
[doi> 10.1145/1454115.1454152]
|
 |
26
|
|
| |
27
|
|
| |
28
|
R. Jin and G. Agrawal. Shared memory parallelization of data mining algorithms: Techniques. citeseer.ist.psu.edu/article/jin02shared.html, 2002.
|
| |
29
|
Ruoming Jin and Gagan Agrawal. A Middleware for Developing Parallel Data Mining Implementations. In Proceedings of the first SIAM conference on Data Mining, April 2001.
|
| |
30
|
Andreas Klockner. PyCuda, 2008.
|
| |
31
|
|
| |
32
|
|
 |
33
|
|
| |
34
|
Shih-Wei Liao. Parallelizing user-defined and implicit reductions globally on multiprocessors. In Chris R. Jesshope and Colin Egan, editors, Asia-Pacific Computer Systems Architecture Conference, volume 4186 of Lecture Notes in Computer Science, pages 189--202. Springer, 2006.
|
| |
35
|
|
| |
36
|
|
| |
37
|
|
| |
38
|
NVidia. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. version 2.0. http://developer.download.nvidia.com/compute/cuda/2.0-Beta2/docs/Programming_Guide_2.0beta2.pdf, June 7 2008.
|
 |
39
|
|
 |
40
|
|
 |
41
|
|
| |
42
|
Timothy J. Purcell , Craig Donner , Mike Cammarano , Henrik Wann Jensen , Pat Hanrahan, Photon mapping on programmable graphics hardware, Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, July 26-27, 2003, San Diego, California
|
| |
43
|
Erik Sintorn and Ulf Assarsson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm. In First Workshop on General Purpose Processing on Graphics Processing Units, Oct 2007.
|
| |
44
|
John Stratton, Sam Stone, and Wen mei Hwu. MCUDA: An Efficient Implementation of CUDA Kernels for Multi-Core CPUs. In 21st Annual Workshop on Languages and Compilers for Parallel Computing (LCPC'2008), July 2008.
|
 |
45
|
|
| |
46
|
|
| |
47
|
Neil Trevett. OpenCL: The Open Standdard for Heterogeneous Parallel Programming, 2008.
|
 |
48
|
|
| |
49
|
Hans P. Zima and Barbara Mary Chapman. Compiling for Distributed-Memory Systems. Proceedings of the IEEE, 81(2):264--287, February 1993. In Special Section on Languages and Compilers for Parallel Machines.
|
|