ACM Home Page
Please provide us with feedback. Feedback
On scalable and efficient distributed failure detectors
Full text PdfPdf (823 KB)
Source Annual ACM Symposium on Principles of Distributed Computing archive
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing table of contents
Newport, Rhode Island, United States
Pages: 170 - 179  
Year of Publication: 2001
ISBN:1-58113-383-9
Authors
Indranil Gupta  Cornell Univ., Ithaca, NY
Tushar D. Chandra  IBM T.J. watson Research Center, Yorktown Heights, NY
Germán S. Goldszmidt  IBM T.J. watson Research Center, Yorktown Heights, NY0
Sponsors
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
SIGOPS: ACM Special Interest Group on Operating Systems
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 85,   Citation Count: 16
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/383962.384010
What is a DOI?

ABSTRACT

Process groups in distributed applications and services rely on failure detectors to detect process failures completely, and as quickly, accurately, and scalably as possible, even in the face of unreliable message deliveries. In this paper, we look at quantifying the optimal scalability, in terms of network load, (in messages per second, with messages having a size limit) of distributed, complete failure detectors as a function of application-specified requirements. These requirements are 1) quick failure detection by some non-faulty process, and 2) accuracy of failure detection. We assume a crash-recovery (non-Byzantine) failure model, and a network model that is probabilistically unreliable (w.r.t. message deliveries and process failures). First, we characterize, under certain independence assumptions, the optimum worst-case network load imposed by any failure detector that achieves an application's requirements. We then discuss why traditional heart beating schemes are inherently unscalable according to the optimal load. We also present a randomized, distributed, failure detector algorithm that imposes an equal expected load per group member. This protocol satisfies the application defined constraints of completeness and accuracy, and speed of detection on an average. It imposes a network load that differs frown the optimal by a sub-optimality factor that is much lower than that for traditional distributed heartbeating schemes. Moreover, this sub-optimality factor does not vary with group size (for large groups).


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
C. Almeida and P. Verissimo. Timing failure detection and real-time group communication in real-time systems. In Proceedings of 8th Euromicro Workshop on Real-Time Systems, June 1996.
3
 
4
5
 
6
 
7
S. A. Fakhouri, G. S. Goldszmidt, I. Gupta, M. Kalantar, and J. A. Pershing. Guffstream - a system for dynamic topology management in multi-domain server farms. Technical Report RC 21954, IBM T.J. Watson Research Center, February 2001.
8
9
 
10
 
11
J. M. Helary and M. Hurfin. Solving Agreement problems with failure detectors; a survey. Annals of Telecommunications, 52(9-10):447-464, September-October 1997.
12
 
13
 
14
R. van Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. In Proceedings of International Conference and Distributed Systems Platforms and Open Distributed Processing (IFIP), 1998.

CITED BY  16

Collaborative Colleagues:
Indranil Gupta: colleagues
Tushar D. Chandra: colleagues
Germán S. Goldszmidt: colleagues