| Hardware fault containment in scalable shared-memory multiprocessors |
| Full text |
Pdf
(2.05 MB)
|
| Source
|
International Symposium on Computer Architecture
archive
Proceedings of the 24th annual international symposium on Computer architecture
table of contents
Denver, Colorado, United States
Pages: 73 - 84
Year of Publication: 1997
ISBN:0-89791-901-7
Also published in ...
|
|
Authors
|
|
Dan Teodosiu
|
Computer Systems Laboratory, Stanford University, Stanford, CA
|
|
Joel Baxter
|
Computer Systems Laboratory, Stanford University, Stanford, CA
|
|
Kinshuk Govil
|
Computer Systems Laboratory, Stanford University, Stanford, CA
|
|
John Chapin
|
Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA and Computer Systems Laboratory, Stanford University, Stanford, CA
|
|
Mendel Rosenblum
|
Computer Systems Laboratory, Stanford University, Stanford, CA
|
|
Mark Horowitz
|
Computer Systems Laboratory, Stanford University, Stanford, CA
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 14, Downloads (12 Months): 42, Citation Count: 10
|
|
|
ABSTRACT
Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
D. Aingworth , C. Chekuri , R. Motwani, Fast estimation of diameter and shortest paths (without matrix multiplication), Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms, p.547-553, January 28-30, 1996, Atlanta, Georgia, United States
|
| |
2
|
|
 |
3
|
J. Chapin , M. Rosenblum , S. Devine , T. Lahiri , D. Teodosiu , A. Gupta, Hive: fault containment for shared-memory multiprocessors, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.12-25, December 03-06, 1995, Copper Mountain, Colorado, United States
|
| |
4
|
M. Galles. "Sealable Pipelined Interconnect for Distributed Endpoint Routing: The SGI SPIDER Chip,' Presented at Hot Interconnects Symposium IV, August 1996.
|
| |
5
|
|
 |
6
|
James R. Goodman , Mary K. Vernon , Philip J. Woest, Efficient synchronization primitives for large-scale cache-coherent multiprocessors, Proceedings of the third international conference on Architectural support for programming languages and operating systems, p.64-75, April 03-06, 1989, Boston, Massachusetts, United States
|
 |
7
|
Mark Heinrich , Jeffrey Kuskin , David Ofelt , John Heinlein , Joel Baxter , Jaswinder Pal Singh , Richard Simoni , Kourosh Gharachorloo , David Nakahira , Mark Horowitz , Anoop Gupta , Mendel Rosenblum , John Hennessy, The performance impact of flexibility in the Stanford FLASH multiprocessor, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.274-285, October 05-07, 1994, San Jose, California, United States
|
| |
8
|
Y.A. Khalidi, J. M. Bernabeu, V. Matena, K, Shirriff, and M, Thadani. "Solafis MC: A Multi Computer OS." In Proceedings of the USENIX 1996 Annual Technical Conference, pp. 191-204, January 1996.
|
 |
9
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
 |
10
|
|
 |
11
|
|
| |
12
|
|
 |
13
|
Christine Morin , Alain Gefflaut , Michel Banâtre , Anne-Marie Kermarrec, COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors, Proceedings of the 23rd annual international symposium on Computer architecture, p.56-65, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
14
|
C. Morin, and I. Puaut. "A Survey of Recoverable Distributed Shared Memory Systems." IRISA Technical Report 975, December 1995,
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
 |
18
|
Mendel Rosenblum , John Chapin , Dan Teodosiu , Scott Devine , Tirthankar Lahiri , Anoop Gupta, Implementing efficient fault containment for multiprocessors: confining faults in a shared-memory multiprocessor environment, Communications of the ACM, v.39 n.9, p.52-61, Sept. 1996
[doi> 10.1145/234215.234471]
|
| |
19
|
|
| |
20
|
|
| |
21
|
J. Vounckx et at. "Fault-Tolerant Compact Routing based on Reduced Structural Information in Wormhole-Switching based Networks." In Proceedings of the Colloquium on Structural bzfor. marion and Communication Complexity, May 1994.
|
| |
22
|
|
 |
23
|
Wolf-Dietrich Weber , Stephen Gold , Pat Helland , Takeshi Shimizu , Thomas Wicki , Winfried Wilcke, The Mercury Interconnect Architecture: a cost-effective infrastructure for high-performance servers, Proceedings of the 24th annual international symposium on Computer architecture, p.98-107, June 01-04, 1997, Denver, Colorado, United States
|
| |
24
|
W. Wulf, R. Levin, and S. P. Harbison. HYDRA/C, mmp: An E.xperl. mental Computer System. McGraw-Hill, 1981.
|
| |
25
|
|
CITED BY 10
|
|
|
|
|
|
|
|
Dejan Milojicic , Alan Messer , James Shau , Guangrui Fu , Alberto Munoz, Increasing relevance of memory hardware errors: a case for recoverable programming models, Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system, September 17-20, 2000, Kolding, Denmark
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|