ACM Home Page
Please provide us with feedback. Feedback
A light-weight cache-based fault detection and checkpointing scheme for MPSoCs enabling relaxed execution synchronization
Full text PdfPdf (443 KB)
Source
International Conference on Compilers, Architecture and Synthesis for Embedded Systems archive
Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems table of contents
Atlanta, GA, USA
SESSION: Resiliency table of contents
Pages 11-20  
Year of Publication: 2008
ISBN:978-1-60558-469-0
Authors
Chengmo Yang  UC San Diego, San Diego, CA, USA
Alex Orailoglu  UC San Diego, San Diego, CA, USA
Sponsors
SIGDA: ACM Special Interest Group on Design Automation
ACM: Association for Computing Machinery
SIGBED: ACM Special Interest Group on Embedded Systems
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 72,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1450095.1450100
What is a DOI?

ABSTRACT

While technology advances have made MPSoCs a standard architecture for embedded systems, their applicability is increasingly being challenged by dramatic increases in the amount of device failures that may occur during execution. Conventional fault tolerance techniques employ a duplication-and-comparison strategy to detect arbitrary execution faults, as well as a checkpointing-and-rollback strategy to recover from the faulty state. Comparison and checkpointing are performed either at task level, thus imposing a large amount of overhead in verifying and backing up memory pages, or at instruction level, thus necessitating a lock-step execution model which significantly limits the attainable performance. To overcome the shortcomings of both strategies, in this paper we propose a cache-based fault tolerance scheme wherein the comparison and checkpointing process is performed at the cache-memory interface. By allowing two processors that execute duplicated tasks to share a single data cache, the proposed scheme is able to verify execution results before writing them back into memory, thus protecting the memory from being polluted by execution faults. This in turn significantly reduces the checkpointing overhead. Meanwhile, since only the data written into memory are compared, the strict instruction-by-instruction synchronization model used in multithreading processors can be relaxed. The simulation results confirm that the proposed scheme only imposes a performance overhead ranging from 1.4% to 10.4%, while both fault detection and execution checkpointing can be effectively attained.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
International Technology Roadmap for Semiconductors (ITRS), 2007 Edition. "Process integration, devices, and structures".
 
3
 
4
5
6
7
8
 
9
A. Wood, "Data integrity concepts, features, and technology," White paper, Tandem divison, Compaq Computer Corporation.
 
10
 
11
 
12
 
13
D. B. Hunt and P. N. Marinos, "A general purpose cache-aided rollback error recovery (CARER) technique," In Proc. FTCS-17, pp. 170--175, 1987.
 
14
 
15
 
16

Collaborative Colleagues:
Chengmo Yang: colleagues
Alex Orailoglu: colleagues