ACM Home Page
Please provide us with feedback. Feedback
Measurement and modeling of computer reliability as affected by system activity
Full text PdfPdf (1.44 MB)
Source ACM Transactions on Computer Systems (TOCS) archive
Volume 4 ,  Issue 3  (August 1986) table of contents
Pages: 214 - 237  
Year of Publication: 1986
ISSN:0734-2071
Authors
R. K. Iyer  Univ. of Illinois at Urbana-Champaign, Urbana
D. J. Rossetti  Stanford Univ., Stanford, CA
M. C. Hsueh  Univ. of Illinois at Urbana-Champaign, Urbana
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 52,   Citation Count: 20
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/6420.6422
What is a DOI?

ABSTRACT

This paper demonstrates a practical approach to the study of the failure behavior of computer systems. Particular attention is devoted to the analysis of permanent failures. A number of important techniques, which may have general applicability in both failure and workload analysis, are brought together in this presentation. These include: smeared averaging of the workload data, clustering of like failures, and joint analysis of workload and failures. Approximately 17 percent of all failures affecting the CPU were estimated to be permanent. The manifestation of a permanent failure was found to be strongly correlated with the level and type of workload prior to the failure. Although, in strict terms, the results only relate to the manifestation of permanent failures and not to their occurrence, there are strong indications that permanent failures are both caused and discovered by increased activity. More measurements and experiments are necessary to determine their respective contributions to the measured workload/failure relationship.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
ARSENAULT, J. E., AND ROBERTS, J.A. Reliability and Maintainability of Electronic Systems. Computer Science Press, Potomac, Md., 1980.
 
2
 
3
BLACKBURN, D. L., AND OETTINGER, F.F. Transient Thermal Response of Power Transistors. In IEEE PESC Conference Record (June 10-12). 1974, Murray Hill, N.J., pp. 140-148.
 
4
BUTNER, S. E., AND IVER, R.K. A statistical study of reliability and system load at SLAC. In Digest, lOth International Symposium on Fault-Tolerant Computing (Kyoto, Japan, Oct. 1-3). IEEE Computer Society Press, 1980.
 
5
CASTILLO, X., AND SIEWIOREK, D.P. Workload, performance and reliability of digital computing systems. In Digest, 11th International Symposium on Fault-Tolerant Computing (Portland, Maine, June 24-26). IEEE Computer Society Press, 1981, pp. 84-89.
 
6
CASTILLO, X., AND SIEWIOREK, D. P. A workload dependent software reliability prediction model. In Digest, 12th International Symposium on Fault-Tolerant Computing (Santa Monica, Calif., June 22-24). IEEE Computer Society Press, 1982.
 
7
FELLER, W. An Introduction to Probability Theory and Its Applications. Wiley, New York, 1968.
 
8
GUNTHER, N. L., AND CARTER, W.C. Remarks on the probability of detecting faults. In Digest, IOth International Symposium on Fault-Tolerant Computing (Kyoto, Japan, Oct. 1-3). IEEE Computer Society Press, 1980, pp. 213-215.
 
9
IBM Corporation. OS/VS System management facilities (SMF). Order No. GC35-0004, IBM Corporation, Poughkeepsie, N.Y., 1973.
 
10
IBM Corporation. OS/VS, DOS/VSE, VM/370 environmental recording editing and printing (EREP) Program. Order No. GC28-0772, IBM Corporation, Poughkeepsie, N.Y., 1979.
 
11
IBM Corporation. OS/VS2 MUS System Programming Library: SYS 4. LOGREC Error Recording. Order No. GC 28-0677-5, IBM Corporation, Poughkeepsie, N.Y., 1982.
 
12
IBM Corporation. IBM System/370 Principles of Operation. Order No. GA22-7000-8, IBM Corporation, Poughkeepsie, N.Y., 1981.
 
13
IVALO, V. E.S. Pulse rating charts for the loadability of semiconductor devices. Electron. Appl. 22, 4 (1962), 148-162.
 
14
IYER, R. K., BUTNER, S. E., AND MCCLUSKEY, E. J. A statistical failure/load relationship; results of a multi-computer study. IEEE Trans. Comput. C-3I, 7 (July 1982), 697-706.
 
15
IYER, R. K., AND ROSSETTi, D.J. A statistical load dependency of CPU errors at SLAC. In Digest, I2th International Symposium on Fault Tolerant Computing (Santa Monica, CaliL, June). IEEE Computer Society Press, 1982, pp. 363-372.
 
16
KUJOWSKI, G. F., AND RYPKA, E.A. Effects of on-off cycling on equipment reliability. In 1978 Reliability and Maintainability Symposium (Los Angeles, Jan. 17-19). IEEE Computer Society Press, 1978, pp. 225-230.
 
17
LAPRtE, J.C. Dependable computing and fault tolerance: Concepts and terminology. In Proceedings of IEEE 15th International Symposium on Fault-Tolerant Computing (Ann Arbor, Mich., June 19-21). IEEE Computer Society Press, 1985, pp. 2-11.
 
18
ROSSETTI, D. J., AND IYER, R.K. A software system for reliability and workload analysis. CRC Tech. Rep. 81-18, Center for Reliable Computing, Computer Systems Laboratory, Stanford Univ., Stanford, Calif., Dec., 1981.
 
19
ROSSETTI, D. J., AND IYER, R.K. Software related failures on IBM 3081: A relationship with system utilization. In Proceedings of COMPSAC 82 (Chicago, Ill., Nov. 8-12). IEEE Computer Society Press, 1982.
 
20
SAS Institute Incorporated. SAS User's Guide, 1979 Edition, SAS Institute Incorporated, Cary, N.C., 1979.
 
21
SHOOMAN, M.L. Probabilistic Reliability: An Engineering Approach. McGraw Hill, New York, 1968.
 
22
SHURMAN, M.B. Time dependent failures rates for jet aircarft. In 1978 Reliability and Maintainability Symposium (Los Angeles, Jan. 17-19). IEEE Computer Society Press, 1978, pp. 198-201.

CITED BY  20


REVIEW

"William Michael McCormack : Reviewer"

This paper addresses the following critical question for system design: As the activity in a computer system increases, does the risk of failure increase faster than the increase in activity? Although no strict “cause and effect” rel  more...

Collaborative Colleagues:
R. K. Iyer: colleagues
D. J. Rossetti: colleagues
M. C. Hsueh: colleagues