ACM Home Page
Please provide us with feedback. Feedback
Unreliable failure detectors for reliable distributed systems
Full text PdfPdf (3.16 MB)
Source Journal of the ACM (JACM) archive
Volume 43 ,  Issue 2  (March 1996) table of contents
Pages: 225 - 267  
Year of Publication: 1996
ISSN:0004-5411
Authors
Tushar Deepak Chandra  IBM, Hawthorne, NY
Sam Toueg  Cornell Univ., Ithaca, NY
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 69,   Downloads (12 Months): 393,   Citation Count: 232
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/226643.226647
What is a DOI?

ABSTRACT

We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties—completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus, the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
AMIR, ~., DOLEV, D., KRAMER, S., AND MALKI, D. 1991. Transis: A communication sub-system for high availability. Tech. Rep. CS91-13 (Nov.), Computer Science Department, The Hebrew University of Jerusalem, Jerusalem, Israel.
 
2
ATrlYA~ H., B^R-NoY, A., DOLEV, D., KOLLER, D., PELEG, D., AND REISCHUK, R. 1987. Achievable cases in an asynchronous environment. In Proceedings of the 28th Symposium on Foundations of Computer Science (Oct.). IEEE Computer Society Press, Washington, D.C., pp. 337-346.
3
4
 
5
BERMAN, P., GARAY, J. A., AND PERRY, K.J. 1989. Towards optimal distributed consensus. In Proceedings of the 30th Symposium on Foundations of Computer Science (Oct.). IEEE Computer Society Press, Washington, D.C., pp. 410-415.
6
 
7
BIRMAN, K. P., COOPER, R., JOSEPH, T. A., KANE, K. P., AND SCHMUCK, F. B. 1990. lsis--A Distributed Programming Environment.
8
9
10
 
11
12
 
13
 
14
CHANDRA, T. D., AND LARREA, M. 1994. E-mail correspondence. Showed that cW cannot be used to solve non-blocking atomic commit.
 
15
16
 
17
CHOR, B., AND DWORK, C. 1989. Randomization in byzantine agreement. Adv. Comput. Res. 5, 443-497.
 
18
CRISTIAN, F. 1987. Issues in the design of highly available computing services. In Annual Symposium of the Canadian Information Processing Society (July), pp. 9-16. Also IBM Res. Rep. RJ5856. Thomas J. Watson Research Center, Hawthorne, N.Y.
 
19
CRISTIAN, F., AGHILI, H., STRONG, R., AND DOLEV, D. 1985/1989. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th International Symposium on Fauh-Tolerant Computing (June 1985), pp. 200-206. A revised version appears as IBM Research Laboratory Technical Report RJ5244 (April 1989). Thomas J. Watson Research Center, Hawthorne, N.Y.
 
20
CRISTIAN, F., DANCEY, R. D., AND DEHN, J. 1990. Fault-tolerance in the advanced automation system. Tech. Rep. RJ 7424 (April), IBM Research Laboratory, Thomas J. Watson Research Center, Hawthorne, N.Y.
21
22
23
 
24
FISCHER, M.J. 1983. The consensus problem in unreliable distributed systems (a brief survey). Tech. Rep. 273 (June), Department of Computer Science, Yale University, New Haven, Conn.
25
26
 
27
 
28
 
29
30
 
31
LAMPORT, L. 1978. The implementation of reliable distributed multiprocess systems. Comput. Netw. 2, 95-114.
32
 
33
 
34
Loul, M., AND ABU-AMARA. 1987. Memory requirements for agreement among unreliable asynchronous processes. Adv. Comput. Res. 4, 163-183.
 
35
MOSES, Y., DOLEV, D., AND HALPERN, J. Y. 1986. Cheating husbands and other stories: a case study of knowledge, action, and communication. Distrib. Comput. 1, 3, 167-176.
 
36
MULLENDER, S. J., ED. 1987. The Amoeba Distributed Operating System. Selected papers 1984-1987. Centre for Mathematics and Computer Science.
37
 
38
39
40
41
 
42
 
43
REISCHUK, R. 1982. A new solution for the Byzantine general's problem. Tech. Rep. ILl 3673 (Nov.), IBM Research Laboratory, Thomas J. Watson Research Center, Hawthorne, N.Y.
44
 
45
SABEL, L., AND MARZULLO, K. 1995. Election vs. consensus in asynchronous systems. Tech. Rep. TR95-411 (Feb.). Univ. California at San Diego. San Diego, Calif. Available at ftp://ftp.cs. cornell.edu/pub/sabel/tr94-1413.ps.
46
 
47
WENSLEY, J. H., LAMPORT, L., GOLDBERG, J., GREEN, M. W., LEVITT, K. N., MELLIAR-SMITH, P., SHOSTAK, R. E., AND WEINSTOCK, C. B. 1978. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proc. IEEE 66, 10 (Oct.), 1240-1255.

CITED BY  232


REVIEW

"Christopher D. Carothers : Reviewer"

In distributed systems, the tractability of computations has been a question of much interest and the subject of much research. The consensus and atomic broadcast problems are of particular interest. Previous work by Dolev et al. more...

Collaborative Colleagues:
Tushar Deepak Chandra: colleagues
Sam Toueg: colleagues