ACM Home Page
Please provide us with feedback. Feedback
A short introduction to failure detectors for asynchronous distributed systems
Full text PdfPdf (776 KB)
Source ACM SIGACT News archive
Volume 36 ,  Issue 1  (March 2005) table of contents
COLUMN: Distributed computing table of contents
Pages: 53 - 70  
Year of Publication: 2005
ISSN:0163-5700
Author
Michel Reynal  IRISA, Rennes Cedex, France,
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 24,   Downloads (12 Months): 199,   Citation Count: 8
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1052796.1052806
What is a DOI?

ABSTRACT

Since the first version of Chandra and Toueg's seminal paper titled "Unreliable failure detectors for reliable distributed systems" in 1991, the failure detector concept has been extensively studied and investigated. This is not at all surprising as failure detection is pervasive in the design, the analysis and the implementation of a lot of fault-tolerant distributed algorithms that constitute the core of distributed system middleware.The literature on this topic is mostly technical and appears mainly in theoretically inclined journals and conferences. The aim of this paper is to offer an introductory survey to the failure detector concept for readers who are not familiar with it and want to quickly understand its aim, its basic principles, its power and limitations. To attain this goal, the paper first describes the motivations that underlie the concept, and then surveys several distributed computing problems showing how they can be solved with the help of an appropriate failure detector. So, this short paper presents motivations, concepts, problems, definitions, and algorithms. It does not contain proofs. It is aimed at people who want to understand basics of failure detectors.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
4
5
 
6
 
7
 
8
9
10
 
11
12
13
 
14
 
15
Chor M., and Dwork C., Randomization in Byzantine Agreement. Adv. in Comp. Research, 5:443--497, 1989.
 
16
 
17
 
18
 
19
Delporte-Gallet C., Fauconnier H. and Guerraoui R., Shared memory vs Message Passing. Tech Report IC/2003/77, EPFL, Lausanne, December 2003.
20
 
21
Delporte-Gallet C., Fauconnier H., Helary J.-M. and Raynal M. Early Stopping in Global Data Computation. IEEE Transactions on Parallel and Distributed Systems, 14(9):909--921, 2003.
 
22
23
 
24
 
25
26
 
27
 
28
 
29
30
 
31
 
32
 
33
 
34
 
35
 
36
 
37
Lamport L., Proving the Correctness of Multiprocess Programs. IEEE Transactions on Software Engineering, SE-3(2):125--143, 1977.
 
38
 
39
 
40
 
41
Mostefaoui A., Mourgaya E., and Raynal M., Asynchronous Implementation of Failure Detectors. Proc. Int. IEEE Conference on Dependable Systems and Networks (DSN'03), IEEE Computer Society Press, pp. 351--360, San Francisco (CA), 2003.
 
42
 
43
44
 
45
Mostefaoui A., S. Rajsbaum S. and Raynal M., The Combined Power of Conditions and Information on Failures to Solve Asynchronous Set Agreement. Tech Report #1688, IRISA, Université de Rennes (France), 2005. http://www.irisa.fr/bibli/publi/pi/2005/1688/1688.html
 
46
 
47
 
48
Mostefaoui A. and Raynal M., Leader-Based Consensus. Parallel Processing Letters, 11(1):95--107, 2001.
 
49
 
50
51
 
52
 
53
Powell D., Failure Mode Assumptions and Assumption Coverage. Proc. of the 22nd Int'l Symposium on Fault-Tolerant Computing (FTCS-22), Boston, MA, pp. 386--395, 1992.
 
54
Rabin M., Randomized Byzantine Generals. Proc. 24th IEEE Symposium on Foundations of Computer Science (FOCS'83), pp. 116--124, Los Alamitos (CA), 1983.
 
55
 
56
Raynal M., Detecting Crash Failures in Asynchronous Systems: What? Why? How? Tutorial given at Proc. Int. Conference on Dependable Systems and Networks (DSN'04), Florence (Italy), 2004.
 
57
Raynal M. and Tronel F., Group Membership Failure Detection: a Simple Protocol and its Probabilistic Analysis. Distributed Systems Engineering Journal, 6(3):95--102, 1999.
 
58

CITED BY  8