|
ABSTRACT
Since the first version of Chandra and Toueg's seminal paper titled "Unreliable failure detectors for reliable distributed systems" in 1991, the failure detector concept has been extensively studied and investigated. This is not at all surprising as failure detection is pervasive in the design, the analysis and the implementation of a lot of fault-tolerant distributed algorithms that constitute the core of distributed system middleware.The literature on this topic is mostly technical and appears mainly in theoretically inclined journals and conferences. The aim of this paper is to offer an introductory survey to the failure detector concept for readers who are not familiar with it and want to quickly understand its aim, its basic principles, its power and limitations. To attain this goal, the paper first describes the motivations that underlie the concept, and then surveys several distributed computing problems showing how they can be solved with the help of an appropriate failure detector. So, this short paper presents motivations, concepts, problems, definitions, and algorithms. It does not contain proofs. It is aimed at people who want to understand basics of failure detectors.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
|
 |
4
|
Marcos K. Aguilera , Carole Delporte-Gallet , Hugues Fauconnier , Sam Toueg, On implementing omega with weak reliability and synchrony assumptions, Proceedings of the twenty-second annual symposium on Principles of distributed computing, p.306-314, July 13-16, 2003, Boston, Massachusetts
[doi> 10.1145/872035.872081]
|
 |
5
|
Marcos K. Aguilera , Carole Delporte-Gallet , Hugues Fauconnier , Sam Toueg, Communication-efficient leader election and consensus with limited link synchrony, Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing, July 25-28, 2004, St. John's, Newfoundland, Canada
[doi> 10.1145/1011767.1011816]
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
 |
9
|
|
 |
10
|
|
| |
11
|
|
 |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
Chor M., and Dwork C., Randomization in Byzantine Agreement. Adv. in Comp. Research, 5:443--497, 1989.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
Delporte-Gallet C., Fauconnier H. and Guerraoui R., Shared memory vs Message Passing. Tech Report IC/2003/77, EPFL, Lausanne, December 2003.
|
 |
20
|
Carole Delporte-Gallet , Hugues Fauconnier , Rachid Guerraoui , Vassos Hadzilacos , Petr Kouznetsov , Sam Toueg, The weakest failure detectors to solve certain fundamental problems in distributed computing, Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing, July 25-28, 2004, St. John's, Newfoundland, Canada
[doi> 10.1145/1011767.1011818]
|
| |
21
|
Delporte-Gallet C., Fauconnier H., Helary J.-M. and Raynal M. Early Stopping in Global Data Computation. IEEE Transactions on Parallel and Distributed Systems, 14(9):909--921, 2003.
|
| |
22
|
|
 |
23
|
|
| |
24
|
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
| |
28
|
|
| |
29
|
|
 |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
|
| |
37
|
Lamport L., Proving the Correctness of Multiprocess Programs. IEEE Transactions on Software Engineering, SE-3(2):125--143, 1977.
|
| |
38
|
|
| |
39
|
|
| |
40
|
|
| |
41
|
Mostefaoui A., Mourgaya E., and Raynal M., Asynchronous Implementation of Failure Detectors. Proc. Int. IEEE Conference on Dependable Systems and Networks (DSN'03), IEEE Computer Society Press, pp. 351--360, San Francisco (CA), 2003.
|
| |
42
|
|
| |
43
|
|
 |
44
|
|
| |
45
|
Mostefaoui A., S. Rajsbaum S. and Raynal M., The Combined Power of Conditions and Information on Failures to Solve Asynchronous Set Agreement. Tech Report #1688, IRISA, Université de Rennes (France), 2005. http://www.irisa.fr/bibli/publi/pi/2005/1688/1688.html
|
| |
46
|
|
| |
47
|
|
| |
48
|
Mostefaoui A. and Raynal M., Leader-Based Consensus. Parallel Processing Letters, 11(1):95--107, 2001.
|
| |
49
|
|
| |
50
|
|
 |
51
|
|
| |
52
|
|
| |
53
|
Powell D., Failure Mode Assumptions and Assumption Coverage. Proc. of the 22nd Int'l Symposium on Fault-Tolerant Computing (FTCS-22), Boston, MA, pp. 386--395, 1992.
|
| |
54
|
Rabin M., Randomized Byzantine Generals. Proc. 24th IEEE Symposium on Foundations of Computer Science (FOCS'83), pp. 116--124, Los Alamitos (CA), 1983.
|
| |
55
|
|
| |
56
|
Raynal M., Detecting Crash Failures in Asynchronous Systems: What? Why? How? Tutorial given at Proc. Int. Conference on Dependable Systems and Networks (DSN'04), Florence (Italy), 2004.
|
| |
57
|
Raynal M. and Tronel F., Group Membership Failure Detection: a Simple Protocol and its Probabilistic Analysis. Distributed Systems Engineering Journal, 6(3):95--102, 1999.
|
| |
58
|
|
CITED BY 8
|
|
|
|
|
|
|
|
Achour Mostefaoui , Sergio Rajsbaum , Michel Raynal , Corentin Travers, Irreducibility and additivity of set agreement-oriented failure detector classes, Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing, July 23-26, 2006, Denver, Colorado, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|