| On scalable and efficient distributed failure detectors |
| Full text |
Pdf
(823 KB)
|
| Source
|
Annual ACM Symposium on Principles of Distributed Computing
archive
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
table of contents
Newport, Rhode Island, United States
Pages: 170 - 179
Year of Publication: 2001
ISBN:1-58113-383-9
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 12, Downloads (12 Months): 85, Citation Count: 16
|
|
|
ABSTRACT
Process groups in distributed applications and services rely on failure detectors to detect process failures completely, and as quickly, accurately, and scalably as possible, even in the face of unreliable message deliveries. In this paper, we look at quantifying the optimal scalability, in terms of network load, (in messages per second, with messages having a size limit) of distributed, complete failure detectors as a function of application-specified requirements. These requirements are 1) quick failure detection by some non-faulty process, and 2) accuracy of failure detection. We assume a crash-recovery (non-Byzantine) failure model, and a network model that is probabilistically unreliable (w.r.t. message deliveries and process failures). First, we characterize, under certain independence assumptions, the optimum worst-case network load imposed by any failure detector that achieves an application's requirements. We then discuss why traditional heart beating schemes are inherently unscalable according to the optimal load. We also present a randomized, distributed, failure detector algorithm that imposes an equal expected load per group member. This protocol satisfies the application defined constraints of completeness and accuracy, and speed of detection on an average. It imposes a network load that differs frown the optimal by a sub-optimality factor that is much lower than that for traditional distributed heartbeating schemes. Moreover, this sub-optimality factor does not vary with group size (for large groups).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
C. Almeida and P. Verissimo. Timing failure detection and real-time group communication in real-time systems. In Proceedings of 8th Euromicro Workshop on Real-Time Systems, June 1996.
|
 |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
S. A. Fakhouri, G. S. Goldszmidt, I. Gupta, M. Kalantar, and J. A. Pershing. Guffstream - a system for dynamic topology management in multi-domain server farms. Technical Report RC 21954, IBM T.J. Watson Research Center, February 2001.
|
 |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
J. M. Helary and M. Hurfin. Solving Agreement problems with failure detectors; a survey. Annals of Telecommunications, 52(9-10):447-464, September-October 1997.
|
 |
12
|
Mikel Larrea , Antonio Fernández , Sergio Arévalo, Optimal implementation of the weakest failure detector for solving consensus (brief announcement), Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing, p.334, July 16-19, 2000, Portland, Oregon, United States
[doi> 10.1145/343477.362113]
|
| |
13
|
|
| |
14
|
R. van Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. In Proceedings of International Conference and Distributed Systems Platforms and Open Distributed Processing (IFIP), 1998.
|
CITED BY 16
|
|
|
|
|
|
|
|
|
|
|
Wei Xu , Jiannong Cao , Beihong Jin , Jing Li , Liang Zhang, GCS-MA: A group communication system for mobile agents, Journal of Network and Computer Applications, v.30 n.3, p.1153-1172, August, 2007
|
|
|
|
|
|
|
|
|
Andrei Korostelev , Johan Lukkien , Jan Nesvadba , Yuechen Qian, QoS management in distributed service oriented systems, Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks, p.345-352, February 13-15, 2007, Innsbruck, Austria
|
|
|
|
|
|
|
|
|
|
|
|
Giuseppe DeCandia , Deniz Hastorun , Madan Jampani , Gunavardhan Kakulapati , Avinash Lakshman , Alex Pilchin , Swaminathan Sivasubramanian , Peter Vosshall , Werner Vogels, Dynamo: amazon's highly available key-value store, ACM SIGOPS Operating Systems Review, v.41 n.6, December 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Naixue Xiong , Athanasios V. Vasilakos , Laurence T. Yang , Lingyang Song , Yi Pan , Rajgopal Kannan , Yingshu Li, Comparative analysis of quality of service and memory usage for adaptive failure detectors in healthcare systems, IEEE Journal on Selected Areas in Communications, v.27 n.4, p.495-509, May 2009
|
|