| Efficient computation of sum-products on GPUs through software-managed cache |
| Full text |
Pdf
(310 KB)
|
Source
|
International Conference on Supercomputing
archive
Proceedings of the 22nd annual international conference on Supercomputing
table of contents
Island of Kos, Greece
SESSION: Memory management
table of contents
Pages 309-318
Year of Publication: 2008
ISBN:978-1-60558-158-3
|
|
Authors
|
|
Mark Silberstein
|
Technion - Israel Institute of Technology, Haifa, Israel
|
|
Assaf Schuster
|
Technion - Israel Institute of Technology, Haifa, Israel
|
|
Dan Geiger
|
Technion - Israel Institute of Technology, Haifa, Israel
|
|
Anjul Patney
|
University of California, Davis, CA, USA
|
|
John D. Owens
|
University of California, Davis, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 30, Downloads (12 Months): 338, Citation Count: 2
|
|
|
ABSTRACT
We present a technique for designing memory-bound algorithms with high data reuse on Graphics Processing Units (GPUs) equipped with close-to-ALU software-managed memory. The approach is based on the efficient use of this memory through the implementation of a software-managed cache. We also present an analytical model for performance analysis of such algorithms. We apply this technique to the implementation of the GPU-based solver of the sum-product or marginalize a product of functions (MPF) problem, which arises in a wide variety of real-life applications in artificial intelligence, statistics, image processing, and digital communications. Our motivation to accelerate MPF originated in the context of the analysis of genetic diseases, which in some cases requires years to complete on modern CPUs. Computing MPF is similar to computing the chain matrix product of multi-dimensional matrices, but is more difficult due to a complex data-dependent access pattern, high data reuse, and a low compute-to-memory access ratio. Our GPU-based MPF solver achieves up to 2700-fold speedup on random data and 270-fold on real-life genetic analysis datasets on GeForce 8800GTX GPU from NVIDIA over the optimized CPU version on an Intel 2.4GHz Core 2 with a 4MB L2 cache.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
J. Balart, M. Gonzalez, X. Martorell, E. Ayguade, Z. Sura, T. Chen, T. Zhang, K. O'brien, and K. O'brien. A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor. In LCPC '07: Proceedings of the 2007 Workshop on Languages and Compilers for Parallel Computing, 2007.
|
| |
2
|
C. Benthin, I. Wald, M. Scherbaum, and H. Friedrich. Ray Tracing on the Cell Processor. IEEE Symposium on Interactive Ray Tracing 2006, pages 15--23, Sept. 2006.
|
 |
3
|
Derek Chiou , Prabhat Jain , Larry Rudolph , Srinivas Devadas, Application-specific memory management for embedded systems using software-controlled caches, Proceedings of the 37th conference on Design automation, p.416-419, June 05-09, 2000, Los Angeles, California, United States
[doi> 10.1145/337292.337523]
|
| |
4
|
Alexandre E. Eichenberger , Kathryn O'Brien , Kevin O'Brien , Peng Wu , Tong Chen , Peter H. Oden , Daniel A. Prener , Janice C. Shepherd , Byoungro So , Zehra Sura , Amy Wang , Tao Zhang , Peng Zhao , Michael Gschwind, Optimizing Compiler for the CELL Processor, Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, p.161-172, September 17-21, 2005
[doi> 10.1109/PACT.2005.33]
|
 |
5
|
Kayvon Fatahalian , Daniel Reiter Horn , Timothy J. Knight , Larkhoon Leem , Mike Houston , Ji Young Park , Mattan Erez , Manman Ren , Alex Aiken , William J. Dally , Pat Hanrahan, Sequoia: programming the memory hierarchy, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, November 11-17, 2006, Tampa, Florida
[doi> 10.1145/1188455.1188543]
|
 |
6
|
|
| |
7
|
M. Fishelson and D. Geiger. Exact genetic linkage computations for general pedigrees. Bioinformatics, 18(Suppl. 1):S189--S198, 2002.
|
 |
8
|
|
| |
9
|
IBM Corporation. Cell Broadband Engine Architecture. http://www.ibm.com/techlib/techlib.nsf/techdocs.
|
 |
10
|
Shoaib Kamil , Kaushik Datta , Samuel Williams , Leonid Oliker , John Shalf , Katherine Yelick, Implicit and explicit optimizations for stencil computations, Proceedings of the 2006 workshop on Memory system performance and correctness, October 22-22, 2006, San Jose, California
[doi> 10.1145/1178597.1178605]
|
| |
11
|
J. Kurzak, W. Alvaro, and J. Dongarra. Fast and small short vector SIMD matrix multiplication kernel for the synergistic processing element of the CELL processor. Technical Report LAPACK Working Note 189, University of Tennessee, 2007.
|
| |
12
|
NVIDIA Corporation. NVIDIA CUDA compute unified device architecture programming guide. http://developer.nvidia.com/cuda, Jan. 2007.
|
| |
13
|
J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, 2007.
|
| |
14
|
P. Pakzad and V. Anantharam. A new look at the generalized distributive law. IEEE Transactions on Information Theory, 50(6):1132--1155, June 2004.
|
 |
15
|
|
| |
16
|
R. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27:3--35, 2001.
|
CITED BY 2
|
|
Byunghyun Jang , Synho Do , Homer Pien , David Kaeli, Architecture-aware optimization targeting multithreaded stream computing, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, p.62-70, March 08-08, 2009, Washington, D.C.
|
|
|
|
|