ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Fundamentals of fault-tolerant distributed computing in asynchronous environments
Full text PdfPdf (204 KB)
Source ACM Computing Surveys (CSUR) archive
Volume 31 ,  Issue 1  (March 1999) table of contents
Pages: 1 - 26  
Year of Publication: 1999
ISSN:0360-0300
Author
Felix C. Gärtner  Darmstadt Univ. of Technology, Darmstadt, Germany
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 48,   Downloads (12 Months): 370,   Citation Count: 21
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/311531.311532
What is a DOI?

Warning: The download time has expired please click on the item to try again.


ABSTRACT

Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
 
5
 
6
ALMEIDA, C., VERISSIMO, P., AND CASIMIRO, A. 1998. The quasi-synchronous approach to fault-tolerant and real-time communication and processing. Technical Report CTI RT- 98-04. Instituto Superior T~cnico, Lisboa, Portugal.
 
7
ALPERN, B. AND SCHNEIDER, F. B. 1985. Defining liveness. Inf. Process. Lett. 21, 4 (Oct.), 181- 185.
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
ATTIYA, H., BAR-NOY, A., DOLEV, D., KOLLER, D., PELEG, D., AND REISCHUK, R. 1987. Achievable cases in an asynchronous environment. In Proceedings of the 28th Annual Symposium on Foundations of Computer Science (Oct. 1987). IEEE Computer Society Press, Los Alamitos, CA, 337-346.
 
16
AVIZIENIS, A. 1976. Fault-tolerant systems. IEEE Trans. Comput. 25, 12 (Dec.), 1304- 1312.
 
17
 
18
 
19
20
 
21
BARTLETT, J. F. 1978. A "NonStop" operating system. In Proceedings of the 11th Hawaii International Conference on System Sciences. IEEE Computer Society Press, Los Alamitos, CA.
 
22
 
23
 
24
25
26
 
27
28
29
30
 
31
32
 
33
 
34
35
 
36
CRISTIAN, F. 1985. A rigorous approach to faulttolerant programming. IEEE Trans. Softw. Eng. 11, 1 (Jan.), 23-31.
37
 
38
 
39
40
41
42
 
43
44
 
45
DOUDOU, A. AND SCHIPER, n. 1997. Muteness detectors for consensus with Byzantine processes. Technical report, EPFL. Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland.
46
47
 
48
 
49
GARG, V. K. 1997. Observation and control for debugging distributed computations. In 3rd Int. Workshop on Automated Debugging (AADEBUG 97, Link~ping, Sweden, May). 1-12.
 
50
 
51
 
52
 
53
GARTNER, F. C. 1998. Specifications for fault tolerance: A comedy of failures. Technical Report TUD-BS-1998-03. Darmstadt University of Technology, Darmstadt, Germany.
 
54
GARTNER, F. C. AND PAGNIA, H. 1998. Enhancing the fault tolerance of replication: another excercise in constrained convergence. In Digest of Fast Abstracts of the 28th IEEE Symposium on Fault Tolerant Computing Systems (FTCS- 28, June). IEEE Computer Society Press, Los Alamitos, CA.
 
55
 
56
 
57
 
58
59
 
60
 
61
 
62
HURFIN, M., MOSTEFAOUI, A., AND RAYNAL, M. 1997. Consensus in asynchronous systems where processes can crash and recover. Technical Report 1144. IRISA, Rennes Cedex, France.
 
63
 
64
 
65
 
66
LAMPORT, L. 1977. Proving the correctness of multiprocess programs. IEEE Trans. Softw. Eng. 3, 2 (Mar.), 125-143.
67
 
68
69
 
70
LAPRIE, J.C. 1985. Dependable computing and fault tolerance: concepts and terminology. In Proceedings of the 15th IEEE Symposium on Fault Tolerant Computing Systems (FTCS-15, June 1985). IEEE Computer Society Press, Los Alamitos, CA, 2-11.
 
71
 
72
LONG, D. D. E., CARROLL, J. L., AND PARK, C. J. 1991. A study of the reliability of Internet sites. In Proceedings of the l Oth IEEE Symposium on Reliable Distributed Systems (Pisa, Italy, Sept.). IEEE Press, Piscataway, NJ, 177-186.
 
73
 
74
 
75
 
76
 
77
OLIVEIRA, R., GUERRAOUI, R., AND SCHIPER, A. 1997. Consensus in the crash-recover model. Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland. Technical Report TR-97/239
78
 
79
RICCIARDI, A., SCHIPER, A., AND BIRMAN, K. 1993. Understanding partitions and the "no partition" assumption. In Proceedings of the 4th Workshop on Future Trends of Distributed Computing Systems (FTDCS-4).
 
80
 
82
SCHIPER, n. AND RICCARDI, n. 1993. Virtuallysynchronous communication based on a weak failure suspector. In Proceedings of the 23rd IEEE Symposium on Fault Tolerant Computing Systems (FTCS-23). IEEE Computer Society Press, Los Alamitos, CA, 534-543.
 
83
84
85
 
86
87
 
88
 
89
 
90
91
 
92
 
93
 
94
 
95
THEEL, O. AND GARTNER, F.C. 1998. On proving the stability of distributed algorithms: selfstabilization vs. control theory. In Proceedings of the International Computers Conference on Systems, Signals, Control (SSCC'98, Durban, South Africa, Sept.), V. B. Bajic, Ed. 58-66.
 
96
 
97

CITED BY  21


REVIEW

"Florin Popentiu : Reviewer"

Ga¨rtner surveys many of the existing methodologies in fault-tolerant distributed computing, and discusses their relationships. The purpose of this paper is twofold. First, it structures the area clearly, using the latest resea  more...