|
ABSTRACT
We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties—completeness and accuracy. We show that Consensus can be solved even with unreliable failure detectors that make an infinite number of mistakes, and determine which ones can be used to solve Consensus despite any number of crashes, and which ones require a majority of correct processes. We prove that Consensus and Atomic Broadcast are reducible to each other in asynchronous systems with crash failures; thus, the above results also apply to Atomic Broadcast. A companion paper shows that one of the failure detectors introduced here is the weakest failure detector for solving Consensus [Chandra et al. 1992].
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
AMIR, ~., DOLEV, D., KRAMER, S., AND MALKI, D. 1991. Transis: A communication sub-system for high availability. Tech. Rep. CS91-13 (Nov.), Computer Science Department, The Hebrew University of Jerusalem, Jerusalem, Israel.
|
| |
2
|
ATrlYA~ H., B^R-NoY, A., DOLEV, D., KOLLER, D., PELEG, D., AND REISCHUK, R. 1987. Achievable cases in an asynchronous environment. In Proceedings of the 28th Symposium on Foundations of Computer Science (Oct.). IEEE Computer Society Press, Washington, D.C., pp. 337-346.
|
 |
3
|
Hagit Attiya , Cynthia Dwork , Nancy Lynch , Larry Stockmeyer, Bounds on the time to reach agreement in the presence of timing uncertainty, Proceedings of the twenty-third annual ACM symposium on Theory of computing, p.359-369, May 05-08, 1991, New Orleans, Louisiana, United States
[doi> 10.1145/103418.103457]
|
 |
4
|
|
| |
5
|
BERMAN, P., GARAY, J. A., AND PERRY, K.J. 1989. Towards optimal distributed consensus. In Proceedings of the 30th Symposium on Foundations of Computer Science (Oct.). IEEE Computer Society Press, Washington, D.C., pp. 410-415.
|
 |
6
|
Ofer Biran , Shlomo Moran , Shmuel Zaks, A combinatorial characterization of the distributed tasks which are solvable in the presence of one faulty processor, Proceedings of the seventh annual ACM Symposium on Principles of distributed computing, p.263-275, August 15-17, 1988, Toronto, Ontario, Canada
[doi> 10.1145/62546.62590]
|
| |
7
|
BIRMAN, K. P., COOPER, R., JOSEPH, T. A., KANE, K. P., AND SCHMUCK, F. B. 1990. lsis--A Distributed Programming Environment.
|
 |
8
|
|
 |
9
|
|
 |
10
|
|
| |
11
|
|
 |
12
|
Tushar Deepak Chandra , Vassos Hadzilacos , Sam Toueg, The weakest failure detector for solving consensus, Proceedings of the eleventh annual ACM symposium on Principles of distributed computing, p.147-158, August 10-12, 1992, Vancouver, British Columbia, Canada
[doi> 10.1145/135419.135451]
|
| |
13
|
|
| |
14
|
CHANDRA, T. D., AND LARREA, M. 1994. E-mail correspondence. Showed that cW cannot be used to solve non-blocking atomic commit.
|
| |
15
|
|
 |
16
|
|
| |
17
|
CHOR, B., AND DWORK, C. 1989. Randomization in byzantine agreement. Adv. Comput. Res. 5, 443-497.
|
| |
18
|
CRISTIAN, F. 1987. Issues in the design of highly available computing services. In Annual Symposium of the Canadian Information Processing Society (July), pp. 9-16. Also IBM Res. Rep. RJ5856. Thomas J. Watson Research Center, Hawthorne, N.Y.
|
| |
19
|
CRISTIAN, F., AGHILI, H., STRONG, R., AND DOLEV, D. 1985/1989. Atomic broadcast: From simple message diffusion to Byzantine agreement. In Proceedings of the 15th International Symposium on Fauh-Tolerant Computing (June 1985), pp. 200-206. A revised version appears as IBM Research Laboratory Technical Report RJ5244 (April 1989). Thomas J. Watson Research Center, Hawthorne, N.Y.
|
| |
20
|
CRISTIAN, F., DANCEY, R. D., AND DEHN, J. 1990. Fault-tolerance in the advanced automation system. Tech. Rep. RJ 7424 (April), IBM Research Laboratory, Thomas J. Watson Research Center, Hawthorne, N.Y.
|
 |
21
|
|
 |
22
|
|
 |
23
|
|
| |
24
|
FISCHER, M.J. 1983. The consensus problem in unreliable distributed systems (a brief survey). Tech. Rep. 273 (June), Department of Computer Science, Yale University, New Haven, Conn.
|
 |
25
|
|
 |
26
|
Ajei Gopal , Ray Strong , Sam Toueg , Flaviu Cristian, Early-delivery atomic broadcast, Proceedings of the ninth annual ACM symposium on Principles of distributed computing, p.297-309, August 22-24, 1990, Quebec City, Quebec, Canada
[doi> 10.1145/93385.93430]
|
| |
27
|
|
| |
28
|
|
| |
29
|
|
 |
30
|
|
| |
31
|
LAMPORT, L. 1978. The implementation of reliable distributed multiprocess systems. Comput. Netw. 2, 95-114.
|
 |
32
|
|
| |
33
|
|
| |
34
|
Loul, M., AND ABU-AMARA. 1987. Memory requirements for agreement among unreliable asynchronous processes. Adv. Comput. Res. 4, 163-183.
|
| |
35
|
MOSES, Y., DOLEV, D., AND HALPERN, J. Y. 1986. Cheating husbands and other stories: a case study of knowledge, action, and communication. Distrib. Comput. 1, 3, 167-176.
|
| |
36
|
MULLENDER, S. J., ED. 1987. The Amoeba Distributed Operating System. Selected papers 1984-1987. Centre for Mathematics and Computer Science.
|
 |
37
|
|
| |
38
|
|
 |
39
|
|
 |
40
|
|
 |
41
|
|
| |
42
|
|
| |
43
|
REISCHUK, R. 1982. A new solution for the Byzantine general's problem. Tech. Rep. ILl 3673 (Nov.), IBM Research Laboratory, Thomas J. Watson Research Center, Hawthorne, N.Y.
|
 |
44
|
|
| |
45
|
SABEL, L., AND MARZULLO, K. 1995. Election vs. consensus in asynchronous systems. Tech. Rep. TR95-411 (Feb.). Univ. California at San Diego. San Diego, Calif. Available at ftp://ftp.cs. cornell.edu/pub/sabel/tr94-1413.ps.
|
 |
46
|
|
| |
47
|
WENSLEY, J. H., LAMPORT, L., GOLDBERG, J., GREEN, M. W., LEVITT, K. N., MELLIAR-SMITH, P., SHOSTAK, R. E., AND WEINSTOCK, C. B. 1978. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proc. IEEE 66, 10 (Oct.), 1240-1255.
|
CITED BY 232
|
|
|
|
|
Fabíola Greve , Michel Hurfin , Raimundo Macêdo , Michel Raynal, Time and message-efficient S-based consensus (brief announcement), Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing, p.332, July 16-19, 2000, Portland, Oregon, United States
|
|
|
|
|
|
|
|
|
|
|
|
Vasken Bohossian , Chenggong C. Fan , Paul S. LeMahieu , Marc D. Riedel , Jehoshua Bruck , Lihao Xu, Computing in the RAIN: A Reliable Array of Independent Nodes, IEEE Transactions on Parallel and Distributed Systems, v.12 n.2, p.99-114, February 2001
|
|
|
Mikel Larrea , Antonio Fernández , Sergio Arévalo, Optimal implementation of the weakest failure detector for solving consensus (brief announcement), Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing, p.334, July 16-19, 2000, Portland, Oregon, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Danny Dolev , Idit Keidar , Esti Yeger Lotem, Dynamic voting for consistent primary components, Proceedings of the sixteenth annual ACM symposium on Principles of distributed computing, p.63-71, August 21-24, 1997, Santa Barbara, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Christian Cachin , Klaus Kursawe , Victor Shoup, Random oracles in constantipole: practical asynchronous Byzantine agreement using cryptography (extended abstract), Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing, p.123-132, July 16-19, 2000, Portland, Oregon, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Marcos K. Aguilera , Carole Delporte-Gallet , Hugues Fauconnier , Sam Toueg, On implementing omega with weak reliability and synchrony assumptions, Proceedings of the twenty-second annual symposium on Principles of distributed computing, p.306-314, July 13-16, 2003, Boston, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Achour Mostefaoui , Sergio Rajsbaum , Michel Raynal , Matthieu Roy, A hierarchy of conditions for consensus solvability, Proceedings of the twentieth annual ACM symposium on Principles of distributed computing, p.151-160, August 2001, Newport, Rhode Island, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Marcos K. Aguilera , Carole Delporte-Gallet , Hugues Fauconnier , Sam Toueg, Communication-efficient leader election and consensus with limited link synchrony, Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing, July 25-28, 2004, St. John's, Newfoundland, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carole Delporte-Gallet , Hugues Fauconnier , Rachid Guerraoui , Vassos Hadzilacos , Petr Kouznetsov , Sam Toueg, The weakest failure detectors to solve certain fundamental problems in distributed computing, Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing, July 25-28, 2004, St. John's, Newfoundland, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gregory Chockler , Murat Demirbas , Seth Gilbert , Calvin Newport , Tina Nolte, Consensus and collision detectors in wireless Ad Hoc networks, Proceedings of the twenty-fourth annual ACM symposium on Principles of distributed computing, July 17-20, 2005, Las Vegas, NV, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hagit Attiya , Rachid Guerraoui , Danny Hendler , Petr Kouznetsov, Synchronizing without locks is inherently expensive, Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing, July 23-26, 2006, Denver, Colorado, USA
|
|
|
Idit Keidar , Alexander Shraer, Timeliness, failure-detectors, and consensus performance, Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing, July 23-26, 2006, Denver, Colorado, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Achour Mostefaoui , Sergio Rajsbaum , Michel Raynal , Corentin Travers, Irreducibility and additivity of set agreement-oriented failure detector classes, Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing, July 23-26, 2006, Denver, Colorado, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
John Dunagan , Nicholas J. A. Harvey , Michael B. Jones , Dejan Kostić , Marvin Theimer , Alec Wolman, FUSE: lightweight guaranteed distributed failure notification, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.11-11, December 06-08, 2004, San Francisco, CA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Andrei Korostelev , Johan Lukkien , Jan Nesvadba , Yuechen Qian, QoS management in distributed service oriented systems, Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks, p.345-352, February 13-15, 2007, Innsbruck, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Samir Djilali , Thomas Herault , Oleg Lodygensky , Tangui Morlier , Gilles Fedak , Franck Cappello, RPC-V: Toward Fault-Tolerant RPC for Internet Connected Desktop Grids with Volatile Nodes, Proceedings of the 2004 ACM/IEEE conference on Supercomputing, p.39, November 06-12, 2004
|
|
|
Paul Stelling , Cheryl DeMatteis , Ian Foster , Carl Kesselman , Craig Lee , Gregor von Laszewski, A fault detection service for wide area distributed computations, Cluster Computing, v.2 n.2, p.117-128, 1999
|
|
|
Xuezheng Liu , Zhenyu Guo , Xi Wang , Feibo Chen , Xiaochen Lian , Jian Tang , Ming Wu , M. Frans Kaashoek , Zheng Zhang, D3S: debugging deployed distributed systems, Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, p.423-437, April 16-18, 2008, San Francisco, California
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tallat M. Shafaat , Thorsten Schütt , Monika Moser , Seif Haridi , Ali Ghodsi , Alexander Reinefeld, Key-based consistency and availability in structured overlay networks, Proceedings of the 17th international symposium on High performance distributed computing, June 23-27, 2008, Boston, MA, USA
|
|
|
|
|
|
|
|
|
|
|
|
Rachid Guerraoui , Maurice Herlihy , Petr Kouznetsov , Nancy Lynch , Calvin Newport, On the weakest failure detector ever, Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, August 12-15, 2007, Portland, Oregon, USA
|
|
|
Alejandro Cornejo , Sergio Rajsbaum , Michel Raynal , Corentin Travers, Failure detectors are schedulers, Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, August 12-15, 2007, Portland, Oregon, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tallat M. Shafaat , Monika Moser , Thorsten Schütt , Alexander Reinefeld , Ali Ghodsi , Seif Haridi, Key-based consistency and availability in structured overlay networks, Proceedings of the 3rd international conference on Scalable information systems, June 04-06, 2008, Vico Equense, Italy
|
|
|
|
|
|
|
|
|
Fernando Castor Filho , Augusta Marques , Raphael Y. de Camargo , Fabio Kon, A group membership service for large-scale grids, Proceedings of the 6th international workshop on Middleware for grid computing, p.1-6, December 01-05, 2008, Leuven, Belgium
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Felix Hupfeld , Björn Kolbeck , Jan Stender , Mikael Högqvist , Toni Cortes , Jonathan Marti , Jesús Malo, FaTLease: scalable fault-tolerant lease negotiation with paxos, Proceedings of the 17th international symposium on High performance distributed computing, June 23-27, 2008, Boston, MA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gunjan Khanna , Mike Yu Cheng , Padma Varadharajan , Saurabh Bagchi , Miguel P. Correia , Paulo J. Veríssimo, Automated Rule-Based Diagnosis through a Distributed Monitor System, IEEE Transactions on Dependable and Secure Computing, v.4 n.4, p.266-279, October 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Marcos K. Aguilera , Carole Delporte-Gallet , Hugues Fauconnier , Sam Toueg, Partial synchrony based on set timeliness, Proceedings of the 28th ACM symposium on Principles of distributed computing, August 10-12, 2009, Calgary, AB, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Felix Hupfeld , Björn Kolbeck , Jan Stender , Mikael Högqvist , Toni Cortes , Jonathan Martí , Jesús Malo, FaTLease: scalable fault-tolerant lease negotiation with Paxos, Cluster Computing, v.12 n.2, p.175-188, June 2009
|
INDEX TERMS
Primary Classification:
C.
Computer Systems Organization
C.2
COMPUTER-COMMUNICATION NETWORKS
Additional Classification:
C.
Computer Systems Organization
C.2
COMPUTER-COMMUNICATION NETWORKS
C.2.4
Distributed Systems
Subjects:
Network operating systems
C.4
PERFORMANCE OF SYSTEMS
Subjects:
Reliability, availability, and serviceability
D.
Software
D.4
OPERATING SYSTEMS
D.4.5
Reliability
Subjects:
Fault-tolerance
General Terms:
Algorithms,
Reliability,
Theory
Keywords:
Byzantine Generals' problem,
agreement problem,
asynchronous systems,
atomic broadcast,
commit problem,
consensus problem,
crash failures,
failure detection,
fault-tolerance,
message passing,
partial synchrony,
processor failures
REVIEW
"Christopher D. Carothers : Reviewer"
In distributed systems, the tractability of
computations has been a question of much interest and the subject of
much research. The consensus and atomic broadcast problems are of
particular interest. Previous work by Dolev et
al.
more...
|