ACM Home Page
Please provide us with feedback. Feedback
Fault tolerance by means of external monitoring of computer systems
Full text PdfPdf (1.82 MB)
Source AFIPS Joint Computer Conferences archive
Proceedings of the May 4-7, 1981, national computer conference table of contents
Chicago, Illinois
SESSION: Computer hardware and architecture table of contents
Pages 27-40  
Year of Publication: 1981
Author
Algirdas Avižienis  University of California at Los Angeles, Los Angeles, California
Sponsor
AFIPS : American Federation of Information Processing Societies
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 2,   Downloads (12 Months): 11,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1500412.1500417
What is a DOI?

ABSTRACT

A frequently suggested solution to the problem of increasing the reliability of an already existing computer system (to be called the object machine [OM]) is to employ a functionally and physically separate monitor computer (to be called the monitor machine [MM]) that probes the operation of the OM in real time. The purpose of the monitoring is to assure that the functional performance of the OM does not deviate from the behavior specified by its design and by the programs being executed.

This paper systematically assesses the architectural and fault-tolerance issues that have to be resolved to effectively implement the monitoring process. The goal of the implementation is to create an integrated and uniformly fault-tolerant OM/MM complex, beginning with a given OM design.

Four principal problems are addressed in the subsequent sections: (1) implementation of the monitor machine; (2) implementation of the monitoring (OM/MM) interface; (3) specification of the monitoring function; and (4) the cost and effectiveness of monitoring.

The paper concludes with examples of model technical specifications for the architectural properties needed by the OM and the MM to attain a fault-tolerant implementation of the monitoring process.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Ahmdahl Corporation, 470V/6 Machine Reference Manual. MRM 1000-1, 1976.
 
2
Anderson, J. E., and F. J. Macri. "Multiple Redundancy Applications in a Computer." Proc. 1967 Ann. Symposium on Reliability. Washington, D.C., January 1967, pp. 553--562.
 
3
 
4
 
5
Avizienis, A., and D. A. Rennels. "Fault-Tolerance Experiments with the JPL STAR Computer." Digest of COMPCON '72 (Sixth Annual IEEE Computer Society Int. Conf.), San Francisco, California, 1972, pp. 321--324.
 
6
Avizienis, A., and L. Chen. "On the Implementation of N-version Programming for Software Fault-Tolerance During Program Execution." Proceedings 1977 Int. Computer Software and Applications Conference, Chicago, Illinois, November 1977, pp. 149--155.
 
7
Avizienis, A., "Fault-Tolerant Computing---Progress, Problems, and Prospects." Proc. IFIP Congress 1977, Toronto, Canada, pp. 405--420.
 
8
Avizienis, A., "Fault-Tolerance: The Survival Attribute of Digital Systems." Proc. IEEE, 66, (1978), pp. 1109--1125.
 
9
The Bell Systems Technical Journal, 56 (1977) (special issue on the IA Processor), pp. 119--315.
 
10
Beuscher, H. J., et al. "Administration and Maintenance Plan of No. 2 ESS." The Bell System Technical Journal, 48 (1969), pp. 2765--2815.
 
11
Burchby, D. D., L. W. Kern, and W. A. Sturm. "Specification of the Fault-Tolerant Spaceborne Computer (FTSC)." Proc. 1976 Int. Symposium on Fault-Tolerant Computing, Pittsburgh, Pennsylvania, June 1976, pp. 129--133.
 
12
Burroughs Corp. Introduction to Burroughs Scientific Processor, 1977.
 
13
Chang, H. Y., G. W. Smith, Jr., and R. B. Walford. "LAMP: System Description." The Bell System Technical Journal, 53 (1974), pp. 1431--1449.
 
14
Control Data Corp. Control Data STAR Computer System: Hardware Reference Manual, 60256000-01, 1970.
 
15
Cordero, H., Jr. "4341's Infrastructure Is New from the Substrate Up." Electronics (November 8, 1979), pp. 110--115.
 
16
CRAY Research, Inc. CRAY-1 Computer System: Reference Manual, 2240004, Rev. B-02, July 1977.
 
17
Digital Equipment Corp., DECSYSTEM 20 Technical Summary, 1976.
 
18
Downing, R. W., J. S. Nowak, and L. S. Tuomenoksa. "No. 1-ESS Maintenance Plan," The Bell System Technical Journal, 43 (1964), pp. 1961--2019.
 
19
Hopkins, A. L., Jr., T. B. Smith, III, and J. H. Lala. "FTMP---A Highly Reliable Fault-Tolerant Multiprocessor for Aircraft." Proc. IEEE, 66 (1978), pp. 1221--1239.
 
20
Siewiorek, D., M. Canepa, and S. Clark. "C.vmp: The Architecture of a Fault-Tolerant Multiprocessor." Proc. 1977 Int. Symposium on Fault-Tolerant Computing, Los Angeles, California, June 1977, pp. 37--43.
 
21
Sklaroff, J. R. "Redundancy Management Technique for Space Shuttle Computers." IMB Journal of Research and Development, 20 (1976), pp. 20--28.
 
22
 
23
Wensley, J. H., et al. "SIFT: The Design and Analysis of a Fault-Tolerant Computer for Aircraft Control." Proc. IEEE, 66 (1978), pp. 1240--1255.
Collaborative Colleagues:
Algirdas Avižienis: colleagues