|
ABSTRACT
A frequently suggested solution to the problem of increasing the reliability of an already existing computer system (to be called the object machine [OM]) is to employ a functionally and physically separate monitor computer (to be called the monitor machine [MM]) that probes the operation of the OM in real time. The purpose of the monitoring is to assure that the functional performance of the OM does not deviate from the behavior specified by its design and by the programs being executed. This paper systematically assesses the architectural and fault-tolerance issues that have to be resolved to effectively implement the monitoring process. The goal of the implementation is to create an integrated and uniformly fault-tolerant OM/MM complex, beginning with a given OM design. Four principal problems are addressed in the subsequent sections: (1) implementation of the monitor machine; (2) implementation of the monitoring (OM/MM) interface; (3) specification of the monitoring function; and (4) the cost and effectiveness of monitoring. The paper concludes with examples of model technical specifications for the architectural properties needed by the OM and the MM to attain a fault-tolerant implementation of the monitoring process.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Ahmdahl Corporation, 470V/6 Machine Reference Manual. MRM 1000-1, 1976.
|
| |
2
|
Anderson, J. E., and F. J. Macri. "Multiple Redundancy Applications in a Computer." Proc. 1967 Ann. Symposium on Reliability. Washington, D.C., January 1967, pp. 553--562.
|
| |
3
|
A. Avizienis , G. C. Gilley , F. P. Mathur , D. A. Rennels , J. A. Rohr , D. K. Rubin, The STAR (Self-Testing And Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design, IEEE Transactions on Computers, v.20 n.11, p.1312-1321, November 1971
[doi> 10.1109/T-C.1971.223133]
|
| |
4
|
|
| |
5
|
Avizienis, A., and D. A. Rennels. "Fault-Tolerance Experiments with the JPL STAR Computer." Digest of COMPCON '72 (Sixth Annual IEEE Computer Society Int. Conf.), San Francisco, California, 1972, pp. 321--324.
|
| |
6
|
Avizienis, A., and L. Chen. "On the Implementation of N-version Programming for Software Fault-Tolerance During Program Execution." Proceedings 1977 Int. Computer Software and Applications Conference, Chicago, Illinois, November 1977, pp. 149--155.
|
| |
7
|
Avizienis, A., "Fault-Tolerant Computing---Progress, Problems, and Prospects." Proc. IFIP Congress 1977, Toronto, Canada, pp. 405--420.
|
| |
8
|
Avizienis, A., "Fault-Tolerance: The Survival Attribute of Digital Systems." Proc. IEEE, 66, (1978), pp. 1109--1125.
|
| |
9
|
The Bell Systems Technical Journal, 56 (1977) (special issue on the IA Processor), pp. 119--315.
|
| |
10
|
Beuscher, H. J., et al. "Administration and Maintenance Plan of No. 2 ESS." The Bell System Technical Journal, 48 (1969), pp. 2765--2815.
|
| |
11
|
Burchby, D. D., L. W. Kern, and W. A. Sturm. "Specification of the Fault-Tolerant Spaceborne Computer (FTSC)." Proc. 1976 Int. Symposium on Fault-Tolerant Computing, Pittsburgh, Pennsylvania, June 1976, pp. 129--133.
|
| |
12
|
Burroughs Corp. Introduction to Burroughs Scientific Processor, 1977.
|
| |
13
|
Chang, H. Y., G. W. Smith, Jr., and R. B. Walford. "LAMP: System Description." The Bell System Technical Journal, 53 (1974), pp. 1431--1449.
|
| |
14
|
Control Data Corp. Control Data STAR Computer System: Hardware Reference Manual, 60256000-01, 1970.
|
| |
15
|
Cordero, H., Jr. "4341's Infrastructure Is New from the Substrate Up." Electronics (November 8, 1979), pp. 110--115.
|
| |
16
|
CRAY Research, Inc. CRAY-1 Computer System: Reference Manual, 2240004, Rev. B-02, July 1977.
|
| |
17
|
Digital Equipment Corp., DECSYSTEM 20 Technical Summary, 1976.
|
| |
18
|
Downing, R. W., J. S. Nowak, and L. S. Tuomenoksa. "No. 1-ESS Maintenance Plan," The Bell System Technical Journal, 43 (1964), pp. 1961--2019.
|
| |
19
|
Hopkins, A. L., Jr., T. B. Smith, III, and J. H. Lala. "FTMP---A Highly Reliable Fault-Tolerant Multiprocessor for Aircraft." Proc. IEEE, 66 (1978), pp. 1221--1239.
|
| |
20
|
Siewiorek, D., M. Canepa, and S. Clark. "C.vmp: The Architecture of a Fault-Tolerant Multiprocessor." Proc. 1977 Int. Symposium on Fault-Tolerant Computing, Los Angeles, California, June 1977, pp. 37--43.
|
| |
21
|
Sklaroff, J. R. "Redundancy Management Technique for Space Shuttle Computers." IMB Journal of Research and Development, 20 (1976), pp. 20--28.
|
| |
22
|
|
| |
23
|
Wensley, J. H., et al. "SIFT: The Design and Analysis of a Fault-Tolerant Computer for Aircraft Control." Proc. IEEE, 66 (1978), pp. 1240--1255.
|
|