ACM Home Page
Please provide us with feedback. Feedback
Architectural core salvaging in a multi-core processor for hard-error tolerance
Full text PdfPdf (545 KB)
Source
International Symposium on Computer Architecture archive
Proceedings of the 36th annual international symposium on Computer architecture table of contents
Austin, TX, USA
SESSION: Reliability and fault tolerance table of contents
Pages 93-104  
Year of Publication: 2009
ISBN:978-1-60558-526-0
Also published in ...
Authors
Michael D. Powell  Intel Massachusetts, Hudson, MA, USA
Arijit Biswas  Intel Massachusetts, Hudson, MA, USA
Shantanu Gupta  University of Michigan, Ann Arbor, MI, USA
Shubhendu S. Mukherjee  Intel Massachusetts, Hudson, MA, USA
Sponsors
SIGARCH: ACM Special Interest Group on Computer Architecture
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 79,   Downloads (12 Months): 212,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1555754.1555769
What is a DOI?

ABSTRACT

The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core.

This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
M. Bushnell and V. Agrawal. Essentials of Electronic Testing for Digital, Memory, and Mixed-Signal VLSI Circuits. Springer, 2000.
 
5
J. Chang, M. Huang, J. Shoemaker, J. Benoit, S.-L. Chen, W. Chen, S. Chiu, R. Ganesan, G. Leong, V. Lukka, S. Rusu, and D. Srivastava. The 65nm 16mb on-die l3 cache for a dual core multi-threaded xeon processor. In 2006 Symposium on VLSI Circuits, pages 126--127, Feb. 2006.
 
6
 
7
 
8
G. Gerosa, S. Curtis, M. D'Addeo, B. Jiang, B. Kuttanna, F. Merchant, B. Patel, M. Taufique, and H. Samarchi. A sub-lw to 2w low-power IA processor formobile internet devices and ultra-mobile PCs in 45nm hi-k metal gate CMOS. In 2008 IEEE International Solid-State Circuits Conference, Feb. 2008.
 
9
M. Gschwind, P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. A novel simd architecture for the cell heterogeneous chip-multiprocessor. In Proceedings of Seventeenth Symposium of IEEE Hot Chips, Aug. 2005.
 
10
S. Gunther, F. Binns, D. M. Carmean, and J. C. Hall. Managing the impact of increasing microprocessor power consumption. In Intel Technology Journal Q1 2001, Q1 2001.
 
11
Intel Corporation. First Details on a Future Intel Design Codenamed Larrabee. http://www.intel.com/pressroom/archive/releases/20080804fact.htm, Aug. 2008.
 
12
Intel Corporation. Intel Core 2 Duo Processor and Intel Core 2 Extreme Processor on 45-nm Process for Platforms Based on Mobile Intel 965 Express Chipset Family. ftp://download.intel.com/design/mobile/datashts/31891401.pdf, Jan. 2008.
 
13
Intel Corporation. Intel Corporation's Multicore Architecture Briefing. http://www.intel.com/pressroom/archive/releases/20080317fact.htm, Mar. 2008.
 
14
 
15
R. Joseph. Exploring salvage techniques for multi-core architectures. In Workshop on High Performance Computing Reliability Issues (HPCRI) 2005, Feb. 2005.
 
16
A. Meixner and D. J. Sorin. Detouring: Translating software to circumvent hard faults in simple cores. In International Conference on Dependable Systems and Networks (DSN2008), pages 80--89, June 2008.
 
17
M. D. Powell, A. Biswas, J. Emer, S. S. Mukherjee, B. R. Sheikh, and S. Yardi. CAMP: A technique to estimate per-structure power at run-time using a few simple parameters. In Fifteenth International Symposium on High Performance Computer Architecture (HPCA), Feb. 2009.
18
19
 
20
 
21
 
22
J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai. Detecting emerging wearout faults. In Workshop on Silicon Errors in Logic - System Effects (SELSE-3), Apr. 2007.
23
 
24
The Standard Performance Evaluation Corporation. Spec CPU2000 suite. http://www.specbench.org/osg/cpu2000/.
 
25
The Standard Performance Evaluation Corporation. Spec CPU2006 suite. http://www.specbench.org/osg/cpu2006/.
 
26
S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. CACTI 5.1. Technical report, HP Laboratories, Palo Alto, 2008.
 
27
D. Weiss, J. J. Wuu, and V. Chin. The on-chip 3-MB subarray-based third-level cache on an Itanium microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1523--1529, 2002.

Collaborative Colleagues:
Michael D. Powell: colleagues
Arijit Biswas: colleagues
Shantanu Gupta: colleagues
Shubhendu S. Mukherjee: colleagues