|
ABSTRACT
High availability is an increasingly important requirement for enterprise systems, often valued more than performance. Systems designed for high availability typically use redundant hardware for error detection and continued uptime in the event of a failure. Chip multiprocessors with an abundance of identical resources like cores, cache and interconnection networks would appear to be ideal building blocks for implementing high availability solutions on chip. However, doing so poses significant challenges with respect to error containment and faulty component replacement. Increasing silicon and transient fault rates with future technology scaling exacerbate the problem. This paper proposes a novel, cost-effective, architecture for high availability systems built from future multi-core processors. We propose a new chip multiprocessor architecture that provides configurable isolation for fault containment and component retirement, based upon cost-effective modifications to commodity designs. The design is evaluated for a state-of-the-art industrial fault model and the proposed architecture is shown to provide effective fault isolation and graceful degradation even when the failure rate is high.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Albonesi, D.H. Selective Cache Ways: On-Demand Cache Resource Allocation. Journal of Instruction-Level Parallelism, Vol. 2, 2000.
|
| |
2
|
|
| |
3
|
Bartlett, W. and Ball, B. Tandem's Approach to Fault Tolerance. Tandem Systems Rev., vol. 4, no. 1, Feb. 1998, pp. 84--95.
|
| |
4
|
David Bernick , Bill Bruckert , Paul Del Vigna , David Garcia , Robert Jardine , Jim Klecka , Jim Smullen, NonStop® Advanced Architecture, Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN'05), p.12-21, June 28-July 01, 2005
[doi> 10.1109/DSN.2005.70]
|
| |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
|
| |
9
|
Dell, T.J. A White paper on the benefit of chipkill-correct ECC for PC Server Main Memory, IBM white paper, http://www-03.ibm.com/servers/eserver/pseries/campaigns/chipkill.pdf.
|
| |
10
|
Eagle Rock Alliance Ltd. Online survey results: 2001 cost of downtime. http://contingencyplanningresearch.com/2001.Survey.pdf, Aug. 2001.
|
| |
11
|
M. L. Fair , C. R. Conklin , S. B. Swaney , P. J. Meaney , W. J. Clarke , L. C. Alves , I. N. Modi , F. Freier , W. Fischer , N. E. Weber, Reliability, availability, and serviceability (RAS) of the IBM eServer z990, IBM Journal of Research and Development, v.48 n.3-4, p.519-534, May 2004
|
| |
12
|
Brian T. Gold , Jangwoo Kim , Jared C. Smolens , Eric S. Chung , Vasileios Liaskovitis , Eriko Nurvitadhi , Babak Falsafi , James C. Hoe , Andreas G. Nowatzyk, Truss: A Reliable, Scalable Server Architecture, IEEE Micro, v.25 n.6, p.51-59, November 2005
[doi> 10.1109/MM.2005.122]
|
| |
13
|
Gold, B. T., Smolens, J. C., Falsafi, B. and Hoe, J. C. The Granularity of Soft-Error Containment in Shared Memory Multiprocessors, Proceedings of The Workshop on Silicon Errors in Logic-System Effects (SELSE), 2006.
|
 |
14
|
|
| |
15
|
|
| |
16
|
Joseph, R. Exploring Core Salvage Techniques for Multi-core Architectures. Workshop on High Performance Computing Reliability Issues, 2005.
|
 |
17
|
|
| |
18
|
Nakano, J. et al. ReViveI/O: Efficient handling of I/O in highly-available rollback-recovery servers. In HPCA, 2006.
|
| |
19
|
Qureshi, M. K. et al. Microarchitecture-based introspection: A technique for transientfault tolerance in microprocessors. In Proc. of 32nd Intl. Symp. on Comp. Arch. (ISCA-32), June 2005.
|
 |
20
|
|
| |
21
|
|
 |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
Timothy J. Slegel , Robert M. Averill III , Mark A. Check , Bruce C. Giamei , Barry W. Krumm , Christopher A. Krygowski , Wen H. Li , John S. Liptay , John D. MacDougall , Thomas J. McPherson , Jennifer A. Navarro , Eric M. Schwarz , Kevin Shum , Charles F. Webb, IBM's S/390 G5 Microprocessor Design, IEEE Micro, v.19 n.2, p.12-23, March 1999
[doi> 10.1109/40.755464]
|
| |
27
|
Jared C. Smolens , Jangwoo Kim , James C. Hoe , Babak Falsafi, Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures, Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, p.257-268, December 04-08, 2004, Portland, Oregon
[doi> 10.1109/MICRO.2004.19]
|
 |
28
|
|
| |
29
|
|
 |
30
|
Jayanth Srinivasan , Sarita V. Adve , Pradip Bose , Jude A. Rivers, The Case for Lifetime Reliability-Aware Microprocessors, Proceedings of the 31st annual international symposium on Computer architecture, p.276, June 19-23, 2004, München, Germany
|
 |
31
|
|
 |
32
|
|
 |
33
|
|
 |
34
|
|
| |
35
|
SPEC Benchmark Suite. http://www.spec.org and http://www.spec.org/cpu/analysis/memory/
|
| |
36
|
International Technology Roadmap for Semiconductors. http://www.itrs.net/
|
| |
37
|
Falcon, A. Faraboschi, P., and Ortega, D. Combining Simulation and Virtualization through Dynamic Sampling. ISPASS-2007.
|
| |
38
|
Foxton Technology, http://www.intel.com/technology/magazine/computing/foxton-technology-0905.htm
|
 |
39
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, Proceedings of the 27th annual international symposium on Computer architecture, p.282-293, June 2000, Vancouver, British Columbia, Canada
|
| |
40
|
|
| |
41
|
Tendler, J. M., Dodson, J. S., Fields Jr., J. S., Le, H., and Sinharoy, B. IBM Power4 system microarchitecture. IBM Journal of Research and Development, 46(1):5--26, 2002.
|
CITED BY 9
|
|
|
|
|
Shantanu Gupta , Shuguang Feng , Amin Ansari , Jason Blome , Scott Mahlke, StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems, Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems, October 19-24, 2008, Atlanta, GA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|