|
ABSTRACT
Even though graphics processors (GPUs) are becoming increasingly popular for general purpose computing, current (and likely near future) generations of GPUs do not provide hardware support for detecting soft/hard errors in computation logic or memory storage cells since graphics applications are inherently fault tolerant. As a result, if an error occurs in GPUs during program execution, the results could be silently corrupted, which is not acceptable for general purpose computations. To improve the fidelity of general purpose computation on GPUs (GPGPU), we investigate software approaches to perform redundant execution. In particular, we propose and study three different, application-level techniques. The first technique simply executes the GPU kernel program twice, and thus achieves roughly half of the throughput of a non-redundant execution. The next two techniques interleave redundant execution with the original code in different ways to take advantage of the parallelism between the original code and its redundant copy. Furthermore, we evaluate the benefits of providing hardware support, including ECC/parity protection to on-chip and off-chip memories, for each of the software techniques. Interestingly, our findings, based on six commonly used applications, indicate that the benefits of complex software approaches are both application and architecture dependent. The simple approach, which executes the kernel twice, is often sufficient and may even outperform the complex ones. Moreover, we argue that the cost is not justified to protect memories with ECC/parity bits.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
AMD Stream Computing, http://ati.amd.com/technology/streamcomputing/index.html
|
| |
2
|
H. Gao, M. Dimitrov, J. Kong, and H. Zhou, "Experiencing Various Massively Parallel Architectures and Programming Models for Data-Intensive Applications", Workshop on Computer Architecture Education (WCAE-08), in conjunction with ISCA-35, 2008.
|
| |
3
|
Nvidia Developing with CUDA, http://www.nvidia.com/object/cuda_develop.html
|
| |
4
|
Qimonda GDDR5 -- White Paper, http://www.qimonda-news.com/download/Qimonda_GDDR5_whitepaper.pdf
|
| |
5
|
Samsung Electronics: 256Mbit GDDR3 SDRAM: Revision 1.8, April 2005
|
| |
6
|
Jie S. Hu , Feihui Li , Vijay Degalahal , Mahmut Kandemir , N. Vijaykrishnan , Mary J. Irwin, Compiler-Directed Instruction Duplication for Soft Error Detection, Proceedings of the conference on Design, Automation and Test in Europe, p.1056-1057, March 07-11, 2005
[doi> 10.1109/DATE.2005.98]
|
| |
7
|
A. Munshi, "OpenCL: Parallel computing o the GPU and CPU", tutorial, SIGGRAPH, 2008.
|
| |
8
|
N. Oh, P. Shirvani, and E. McCluskey, "Error detection by duplicating instruction in super-scalar processors", IEEE Trans. on Reliability, 2002.
|
| |
9
|
M. Qureshi, O. Mutlu, and Y. Patt, "Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors", DSN, 2005
|
| |
10
|
M. Rebaudengo, M. Sonza Reorda, M. Torchiano, M. Violante, "A source-to-source compiler for generating dependable software", IEEE International Workshop on Source Code Analysis and Manipulation, 2001.
|
 |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
N. Wang and S. Patel, "ReStore: Symptom Based Soft Error Detection in Microprocessors", in DSN, 2005.
|
 |
17
|
Christopher Weaver , Joel Emer , Shubhendu S. Mukherjee , Steven K. Reinhardt, Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor, Proceedings of the 31st annual international symposium on Computer architecture, p.264, June 19-23, 2004, München, Germany
|
|