ACM Home Page
Please provide us with feedback. Feedback
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance
Full text PdfPdf (367 KB)
Source Architectural Support for Programming Languages and Operating Systems archive
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems table of contents
San Jose, California, USA
SESSION: Hardware reliability and fault tolerance table of contents
Pages: 83 - 94  
Year of Publication: 2006
ISBN:1-59593-451-0
Also published in ...
Authors
Vimal K. Reddy  North Carolina State University
Eric Rotenberg  North Carolina State University
Sailashri Parthasarathy  Intel Corporation
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
SIGPLAN: ACM Special Interest Group on Programming Languages
SIGOPS: ACM Special Interest Group on Operating Systems
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 88,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1168857.1168869
What is a DOI?

ABSTRACT

Redundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance duplicates instructions only during periods of poor single-thread performance. (ii) ReStore does not explicitly duplicate instructions and instead exploits mispredictions among highly confident branch predictions as symptoms of faults. (iii) Slipstream creates a reduced alternate thread by replacing many instructions with highly confident predictions. We explore PRT as a possible direction for achieving the fault tolerance of full duplication with the performance of single-thread execution. Opportunistic and ReStore yield partial coverage since they are restricted to using only partial duplication or only confident predictions, respectively. Previous analysis of Slipstream fault tolerance was cursory and concluded that only duplicated instructions are covered. In this paper, we attempt to better understand Slipstream's fault tolerance, conjecturing that the mixture of partial duplication and confident predictions actually closely approximates the coverage of full duplication. A thorough dissection of prediction scenarios confirms that faults in nearly 100% of instructions are detectable. Fewer than 0.1% of faulty instructions are not detectable due to coincident faults and mispredictions. Next we show that the current recovery implementation fails to leverage excellent detection capability, since recovery sometimes initiates belatedly, after already retiring a detected faulty instruction. We propose and evaluate a suite of simple microarchitectural alterations to recovery and checking. Using the best alterations, Slipstream can recover from faults in 99% of instructions, compared to only 78% of instructions without alterations. Both results are much higher than predicted by past research, which claims coverage for only duplicated instructions, or 65% of instructions. On an 8-issue SMT processor, Slipstream performs within 1.3% of single-thread execution whereas full duplication slows performance by 14%.A key byproduct of this paper is a novel analysis framework in which every dynamic instruction is considered to be hypothetically faulty, thus not requiring explicit fault injection. Fault coverage is measured in terms of the fraction of candidate faulty instructions that are directly or indirectly detectable before.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
D. Burger, T.M. Austin, and S. Bennett. The Simplescalar Toolset, Version 2. Tech. Report CS-TR-1997-1342, CS Department, University of Wisconsin-Madison, July 1997.
3
4
 
5
 
6
S. Kumar and A. Aggarwal. Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors. 12th International Symposium on High-Performance Computer Architecture, pp. 212--221, Feb. 2006.
7
8
 
9
10
 
11
S. Parthasarathy. Improving transient fault tolerance of slipstream processors. M.S. Thesis, ECE Department, North Carolina State University, Dec. 2005.
 
12
13
 
14
Z. Purser, K. Sundaramoorthy, and E. Rotenberg. Slipstream memory hierarchies. Tech. Report CESR-TR-02-3, ECE Department, North Carolina State University, Feb. 2002.
 
15
16
 
17
18
 
19
 
20
E. Rotenberg. Exploiting large ineffectual instruction sequences. Technical Report, North Carolina State University, Nov. 1999.
21
 
22
23
24
25
 
26


Collaborative Colleagues:
Vimal K. Reddy: colleagues
Eric Rotenberg: colleagues
Sailashri Parthasarathy: colleagues