|
Warning: The download time has expired please click on the item to try again.
ABSTRACT
Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
ALMEIDA, C., VERISSIMO, P., AND CASIMIRO, A. 1998. The quasi-synchronous approach to fault-tolerant and real-time communication and processing. Technical Report CTI RT- 98-04. Instituto Superior T~cnico, Lisboa, Portugal.
|
| |
7
|
ALPERN, B. AND SCHNEIDER, F. B. 1985. Defining liveness. Inf. Process. Lett. 21, 4 (Oct.), 181- 185.
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
ATTIYA, H., BAR-NOY, A., DOLEV, D., KOLLER, D., PELEG, D., AND REISCHUK, R. 1987. Achievable cases in an asynchronous environment. In Proceedings of the 28th Annual Symposium on Foundations of Computer Science (Oct. 1987). IEEE Computer Society Press, Los Alamitos, CA, 337-346.
|
| |
16
|
AVIZIENIS, A. 1976. Fault-tolerant systems. IEEE Trans. Comput. 25, 12 (Dec.), 1304- 1312.
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
BARTLETT, J. F. 1978. A "NonStop" operating system. In Proceedings of the 11th Hawaii International Conference on System Sciences. IEEE Computer Society Press, Los Alamitos, CA.
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
|
 |
28
|
|
 |
29
|
Tushar Deepak Chandra , Vassos Hadzilacos , Sam Toueg , Bernadette Charron-Bost, On the impossibility of group membership, Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, p.322-330, May 23-26, 1996, Philadelphia, Pennsylvania, United States
[doi> 10.1145/248052.248120]
|
 |
30
|
|
| |
31
|
|
 |
32
|
|
| |
33
|
|
| |
34
|
|
 |
35
|
|
| |
36
|
CRISTIAN, F. 1985. A rigorous approach to faulttolerant programming. IEEE Trans. Softw. Eng. 11, 1 (Jan.), 23-31.
|
 |
37
|
|
| |
38
|
|
| |
39
|
|
 |
40
|
|
 |
41
|
|
 |
42
|
|
| |
43
|
|
 |
44
|
|
| |
45
|
DOUDOU, A. AND SCHIPER, n. 1997. Muteness detectors for consensus with Byzantine processes. Technical report, EPFL. Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland.
|
 |
46
|
|
 |
47
|
|
| |
48
|
|
| |
49
|
GARG, V. K. 1997. Observation and control for debugging distributed computations. In 3rd Int. Workshop on Automated Debugging (AADEBUG 97, Link~ping, Sweden, May). 1-12.
|
| |
50
|
|
| |
51
|
|
| |
52
|
|
| |
53
|
GARTNER, F. C. 1998. Specifications for fault tolerance: A comedy of failures. Technical Report TUD-BS-1998-03. Darmstadt University of Technology, Darmstadt, Germany.
|
| |
54
|
GARTNER, F. C. AND PAGNIA, H. 1998. Enhancing the fault tolerance of replication: another excercise in constrained convergence. In Digest of Fast Abstracts of the 28th IEEE Symposium on Fault Tolerant Computing Systems (FTCS- 28, June). IEEE Computer Society Press, Los Alamitos, CA.
|
| |
55
|
|
| |
56
|
|
| |
57
|
|
| |
58
|
|
 |
59
|
|
| |
60
|
|
| |
61
|
|
| |
62
|
HURFIN, M., MOSTEFAOUI, A., AND RAYNAL, M. 1997. Consensus in asynchronous systems where processes can crash and recover. Technical Report 1144. IRISA, Rennes Cedex, France.
|
| |
63
|
|
| |
64
|
|
| |
65
|
|
| |
66
|
LAMPORT, L. 1977. Proving the correctness of multiprocess programs. IEEE Trans. Softw. Eng. 3, 2 (Mar.), 125-143.
|
 |
67
|
|
| |
68
|
|
 |
69
|
|
| |
70
|
LAPRIE, J.C. 1985. Dependable computing and fault tolerance: concepts and terminology. In Proceedings of the 15th IEEE Symposium on Fault Tolerant Computing Systems (FTCS-15, June 1985). IEEE Computer Society Press, Los Alamitos, CA, 2-11.
|
| |
71
|
|
| |
72
|
LONG, D. D. E., CARROLL, J. L., AND PARK, C. J. 1991. A study of the reliability of Internet sites. In Proceedings of the l Oth IEEE Symposium on Reliable Distributed Systems (Pisa, Italy, Sept.). IEEE Press, Piscataway, NJ, 177-186.
|
| |
73
|
|
| |
74
|
|
| |
75
|
|
| |
76
|
|
| |
77
|
OLIVEIRA, R., GUERRAOUI, R., AND SCHIPER, A. 1997. Consensus in the crash-recover model. Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland. Technical Report TR-97/239
|
 |
78
|
David A. Patterson , Garth Gibson , Randy H. Katz, A case for redundant arrays of inexpensive disks (RAID), Proceedings of the 1988 ACM SIGMOD international conference on Management of data, p.109-116, June 01-03, 1988, Chicago, Illinois, United States
|
| |
79
|
RICCIARDI, A., SCHIPER, A., AND BIRMAN, K. 1993. Understanding partitions and the "no partition" assumption. In Proceedings of the 4th Workshop on Future Trends of Distributed Computing Systems (FTDCS-4).
|
| |
80
|
|
| |
82
|
SCHIPER, n. AND RICCARDI, n. 1993. Virtuallysynchronous communication based on a weak failure suspector. In Proceedings of the 23rd IEEE Symposium on Fault Tolerant Computing Systems (FTCS-23). IEEE Computer Society Press, Los Alamitos, CA, 534-543.
|
| |
83
|
|
 |
84
|
|
 |
85
|
|
| |
86
|
|
 |
87
|
|
| |
88
|
|
| |
89
|
|
| |
90
|
|
 |
91
|
|
| |
92
|
|
| |
93
|
|
| |
94
|
|
| |
95
|
THEEL, O. AND GARTNER, F.C. 1998. On proving the stability of distributed algorithms: selfstabilization vs. control theory. In Proceedings of the International Computers Conference on Systems, Signals, Control (SSCC'98, Durban, South Africa, Sept.), V. B. Bajic, Ed. 58-66.
|
| |
96
|
|
| |
97
|
|
CITED BY 21
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yingqian Zhang , Efrat Manisterski , Sarit Kraus , V. S. Subrahmanian , David Peleg, Computing the fault tolerance of multi-agent deployment, Artificial Intelligence, v.173 n.3-4, p.437-465, March, 2009
|
|
|
|
|
|
|
|
|
|
|
|
Maxim D. Peysakhov , Robert N. Lass , William C. Regli , Moshe Kam, An ecological approach to agent population management, Proceedings of the 20th national conference on Artificial intelligence, p.146-151, July 09-13, 2005, Pittsburgh, Pennsylvania
|
|
|
|
|
|
|
|
|
|
INDEX TERMS
Primary Classification:
C.
Computer Systems Organization
C.4
PERFORMANCE OF SYSTEMS
Subjects:
Modeling techniques
Additional Classification:
A.
General Literature
C.
Computer Systems Organization
C.4
PERFORMANCE OF SYSTEMS
Subjects:
Reliability, availability, and serviceability
General Terms:
Algorithms,
Design,
Reliability,
Theory
Keywords:
agreement problem,
asynchronous system,
consensus problem,
failure correction,
failure detection,
fault models,
fault tolerance,
liveness,
message passing,
possibility detection,
predicate detection,
redundancy,
safety
REVIEW
"Florin Popentiu : Reviewer"
Ga¨rtner surveys many of the existing methodologies
in fault-tolerant distributed computing, and discusses their
relationships. The purpose of this paper is twofold. First, it
structures the area clearly, using the latest resea
more...
|