ACM Home Page
Please provide us with feedback. Feedback
Hardware fault containment in scalable shared-memory multiprocessors
Full text PdfPdf (2.05 MB)
Source International Symposium on Computer Architecture archive
Proceedings of the 24th annual international symposium on Computer architecture table of contents
Denver, Colorado, United States
Pages: 73 - 84  
Year of Publication: 1997
ISBN:0-89791-901-7
Also published in ...
Authors
Dan Teodosiu  Computer Systems Laboratory, Stanford University, Stanford, CA
Joel Baxter  Computer Systems Laboratory, Stanford University, Stanford, CA
Kinshuk Govil  Computer Systems Laboratory, Stanford University, Stanford, CA
John Chapin  Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA and Computer Systems Laboratory, Stanford University, Stanford, CA
Mendel Rosenblum  Computer Systems Laboratory, Stanford University, Stanford, CA
Mark Horowitz  Computer Systems Laboratory, Stanford University, Stanford, CA
Sponsor
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 1,   Downloads (12 Months): 36,   Citation Count: 10
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/264107.264141
What is a DOI?

ABSTRACT

Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
 
4
M. Galles. "Sealable Pipelined Interconnect for Distributed Endpoint Routing: The SGI SPIDER Chip,' Presented at Hot Interconnects Symposium IV, August 1996.
 
5
6
7
 
8
Y.A. Khalidi, J. M. Bernabeu, V. Matena, K, Shirriff, and M, Thadani. "Solafis MC: A Multi Computer OS." In Proceedings of the USENIX 1996 Annual Technical Conference, pp. 191-204, January 1996.
9
10
11
 
12
13
 
14
C. Morin, and I. Puaut. "A Survey of Recoverable Distributed Shared Memory Systems." IRISA Technical Report 975, December 1995,
 
15
 
16
 
17
18
 
19
 
20
 
21
J. Vounckx et at. "Fault-Tolerant Compact Routing based on Reduced Structural Information in Wormhole-Switching based Networks." In Proceedings of the Colloquium on Structural bzfor. marion and Communication Complexity, May 1994.
 
22
23
 
24
W. Wulf, R. Levin, and S. P. Harbison. HYDRA/C, mmp: An E.xperl. mental Computer System. McGraw-Hill, 1981.
 
25

CITED BY  10

Collaborative Colleagues:
Dan Teodosiu: colleagues
Joel Baxter: colleagues
Kinshuk Govil: colleagues
John Chapin: colleagues
Mendel Rosenblum: colleagues
Mark Horowitz: colleagues