| A Bayesian approach to fault classification |
| Full text |
Pdf
(877 KB)
|
| Source
|
Joint International Conference on Measurement and Modeling of Computer Systems
archive
Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
table of contents
Univ. of Colorado, Boulder, Colorado, United States
Pages: 58 - 66
Year of Publication: 1990
ISBN:0-89791-359-0
Also published in ...
|
|
Authors
|
|
Tein-Hsiang Lin
|
Department of Electrical and Computer Engineering, State University of New York at Buffalo, Buffalo, New York
|
|
Kang G. Shin
|
Real-Time Computing Laboratory, Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, Michigan
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 4, Downloads (12 Months): 27, Citation Count: 2
|
|
|
ABSTRACT
According to their temporal behavior, faults in computer systems are classified into permanent, intermittent, and transient faults. Since it is impossible to identify the type of a fault upon its first detection, the common practice is to retry the failed instruction one or more times and then use other fault recovery methods, such as rollback or restart, if the retry is not successful. To determine an “optimal” (in some sense) number of retries, we need to know several fault parameters, which can be estimated only after classifying all the faults detected in the past.
In this paper we propose a new fault classification scheme which assigns a fault type to each detected fault based on its detection time, the outcome of retry, and its detection symptom. This classification procedure utilizes the Bayesian decision theory to sequentially update the estimation of fault parameters whenever a detected fault is classified. An important advantage of this classification is the early identification of presence of an intermittent fault so that appropriate measures can be taken before it causes a serious damage to the system. To assess the goodness of the proposed scheme, the probability of incorrect classification is also analyzed and compared with simulation results.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
K. G. Shin and Y.-H. Lee, "Error detection process - model, design, and its impact on computer performance," IEEE Trans. Computers, vol. C-33, pp. 529- 540, June 1984.
|
| |
2
|
J. C. Laprie, "Dependable computing and fault tolerance: Concepts and terminology," Digest of papers, FTCS-15, pp. 2-11, June 1985.
|
| |
3
|
P. A. Lee , T. Anderson , J. C. Laprie , A. Avizienis , H. Kopetz, Fault Tolerance: Principles and Practice, Springer-Verlag New York, Inc., Secaucus, NJ, 1990
|
| |
4
|
D. P. Siewiorek and R. S. Swarz, The Theory and Practice of Reliable System Design. Bedford, MA: Digital Equipment Corporation, 1982.
|
| |
5
|
D. P. Siewiorek, V. Kini, H. Mashburn, S. R. McConnel, and M. M. Tsao, "A case study of c.mmp, cm*, and c.vmp: Part i -experiences with fault tolerance in nlultiprocessor systems," Proceedings of the IEEE, vol. 66, pp. 1178-1199, Oct. 1978.
|
| |
6
|
O. Tasar and V. Tasar, "A study of intermittent fault in digital computers," Proc. Nat. Comput. Con}., pp. 807- 811, June 1977.
|
 |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
J. O. Berger, Statistical Decision Theory, Foundations, Concepts, Methods. New York: Springer-Verlag, 2nd ed., 1985.
|
| |
11
|
M. H. DeGroot, Optimal Statistical Decisions. New York: McGraw-Hill Book Company, 1970.
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE Design Automation Conference on
Gwo-Dong Chen
, Daniel D. Gajski
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
|