ACM Home Page
Please provide us with feedback. Feedback
Optimal design and use of retry in fault-tolerant computer systems
Full text PdfPdf (1.86 MB)
Source Journal of the ACM (JACM) archive
Volume 35 ,  Issue 1  (January 1988) table of contents
Pages: 45 - 69  
Year of Publication: 1988
ISSN:0004-5411
Authors
Yann-Heng Lee  IBM Thomas J. Watson Research Center, Yorktown Heights, NY
Kang G. Shin  Univ. of Michigan, Ann Arbor
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 29,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/42267.42269
What is a DOI?

ABSTRACT

In this paper, a new method is presented for (i) determining an optimal retry policy and (ii) using retry for fault characterization, which is defined as classification of the fault type and determination of fault durations. First, an optimal retry policy is derived for a given fault characteristic, which determines the maximum allowable retry durations so as to minimize the total task completion time. Then, the combined fault characterization and retry decision, in which the characteristic of a fault is estimated simultaneously with the determination of the optimal retry policy, are carried out. Two solution approaches are developed: one is based on point estimation and the other on Bayes sequential decision analysis. Numerical examples are presented in which all the durations associated with faults (i.e., active, benign, and interfailure durations) have monotone hazard rate functions (e.g., exponential Weibull and gamma distributions). These are standard distributions commonly used for modeling and analyses of faults.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
BALL, M., AND HARDIE, F. Effects and detection of intermittent failures in digital systems. In Proceedings ofAFIPS Fall Joint Computer Conference, vol. 35. AFIPS Press, Reston, Va., 1969, pp. 329-335.
 
2
BERGER, I.O. Statistical Decision Theory. Springer-Vedag, New York, 1980.
 
3
BOONE, L. A., LI~aERC_,OT, H. L., AND SEDMAK, R.M. Availability, reliability, and maintainability aspects of the SPERRY UNIVAC 1100/60. In Proceedings of the lOth Annual International Symposium on Fault-Tolerant Computing (Kyoto, Japan). IEEE, New York, 1980, pp. 3-9.
 
4
CARTER, W.C. A short survey of some aspects of hardware design techniques for fault tolerance. IBM Research Rep. RC-10811. IBM, Yorktown Heights, N.Y., 1984.
 
5
CARTER, W. C., PUTZOLU, G. R., WADIA, A. B., BOURICIUS, W. G., JESSEP, D. C., HSIEH, E. P., AND TAN, C.J. Cost effectiveness of self checking computer design. In Proceedings of the 7th Annual International Symposium on Fault-Tolerant Computing (Los Angeles, Calif.). IEEE New York,1977, pp. 117-123.
 
6
CHERNOFF, H. Sequential Analysis and Optimal Design. SIAM, Philadelphia, Pa., 1972.
 
7
CINLAR, E. Introduction to Stochastic Processes. Prentice-Hall, New York, 1975.
 
8
COHEN, A.C. Progressively censored samples in life testing. Technometrics 5, 3 (Aug. 1963), 327-339.
 
9
COHEN, A.C. Multi-censored sampling in the three parameter WeibuU distribution. Technometrics 17, 3 (Aug. 1975), 347-351.
 
10
COHEN, A. C. Progressively censored sampling in the three-parameter gamma distribution. Technometrics 19, 3 (Aug. 1977), 333-340.
 
11
DEGROOT, M. H. Optimal Statistical Decision. McGraw-Hill, New York, 1970.
 
12
DROUL~rrE, D. L. Recovery through programming system/360-System/370. In Proceedings of the AFIPS Spring Computer Conference, vol. 38. AFIPS Press, Reston, Va., 1971, pp. 467--476.
 
13
IRLE, A., AND SCHMITZ, N, Decision theory for continuous observations I: Bayes solutions. In Transactions of the 7th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes. 1974, pp. 209-221.
 
14
KORFN, i., AND StJ, S. Y.H. Reliability analysis of N-modular redundancy systems with intermittent and permanent faults. IEEE Trans. Comput. 28, 7 (July 1979), 514-520.
 
15
LEHM^NN, E.L. Testing Statistical Hypotheses. Wiley, New York, 1959.
 
16
Lemon, G.H. Maximum likelihood estimation for the three paramemet weibull distribution based on censored samples. Technometrics 17, 2, 1975, 247-254.
 
17
MAESTRI, G.H. The Retryable Processor. In Proceedings of AFIPS Fall Joint Computer Conference vcd dl AlWlPSI Prong, Ra.~tnn, Va., !972, pp. 273-277.
 
18
Ross, S.M. Stochastic Processes. Wiley, New York, 1983.
 
19
SHFDLE'rsKv, J.J. The error latency of a fault in a sequential digital circuit. IEEE Trans. Comput. C-25, 6 (June 1976), 655-659.
 
20
SHEDLETSKY, J. J., AND MCCLUSKY, E.j. The error latency of a fault in a combinational digital circuit. In Proceedings of the 5th Symposium on Fault-Tolerant Computing (Pads, France). IEEE, New York, 1975, pp. 210-214.
 
21
SHnN, K. G., ^NI) LFE, Y.H. Error detection process: Model, design, and its impact on computer performance. IEEE Trans. Comput. C-33, 6 (June 1984), 529-540.
 
22
 
23
Sn~WXOR~K, D. P., ANt SWARZ, R.S. The Theory and Practice of Reliable System Design. Digital Press, Educational Services, Digital Equipment Corporation, Bedford, Mass., 1982.
 
24
STvXN, C. A note on cumulative sums. Ann. Math. Stat. i 7 (i 946), 489--499.
 
25
STXFFLEn, J. J., ^NO BRYANT L.A. CARE III phase report--Mathematical description. NASA Rep. 3566. NASA, Washington, D.C., Nov. 1982.
 
26
Tasar O., and Tasar, V. A study of international faults in digital computers. In proceeding of AFIPS National Computer Conference, vol. 46. AFIPS Press, Reston, Va., 1977, pp. 807-811.
 
27
V^N Zwer, W.R. Bias in estimation from type I censored samples. Statist. Neerlandica 20 (1966), !43-!48.
 
28
W~NGO, D. R. Solution of the three-parameter Weibull equations by constrained modified quasilinearization (progressively censored samples). IEEE Trans. Reliability R-22, 2 (June 1973), 96-102.


Collaborative Colleagues:
Yann-Heng Lee: colleagues
Kang G. Shin: colleagues