|
ABSTRACT
Cluster-based servers can substantially increase performance when nodes cooperate to globally manage resources. However, in this paper we show that cooperation results in a substantial availability loss, in the absence of high-availability mechanisms. Specifically, we show that a sophisticated cluster-based Web server, which gains a factor of 3 in performance through cooperation, increases service unavailability by a factor of 10 over a non-cooperative version. We then show how to augment this Web server with software components embodying a small set of high-availability techniques to regain the lost availability. Among other interesting observations, we show that the application of multiple high-availability techniques, each implemented independently in its own subsystem, can lead to inconsistent recovery actions. We also show that a novel technique called Fault Model Enforcement can be used to resolve such inconsistencies. Augmenting the server with these techniques led to a final expected availability of close to 99.99%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
Mike Y. Chen , Emre Kiciman , Eugene Fratkin , Armando Fox , Eric Brewer, Pinpoint: Problem Determination in Large, Dynamic Internet Services, Proceedings of the 2002 International Conference on Dependable Systems and Networks, p.595-604, June 23-26, 2002
|
| |
8
|
[8] Cisco CSS 11500 Series Content Services Switches, Apr. 2003. Available at http://www.cisco.com/en/US/products/hw/contnetw/ps792/ index.html.
|
| |
9
|
|
| |
10
|
[10] F. Cristian and F. Schmuck. Agreeing on Processor Group Membership in Timed Asynchronous Distributed Systems. 1995.
|
| |
11
|
|
| |
12
|
[12] J. Gray. A Census of Tandem System Availability Between 1985 and 1990. IEEE Transactions on Reliability, 39(4):409- 418, Oct. 1990.
|
| |
13
|
Steven D. Gribble , Eric A. Brewer , Joseph M. Hellerstein , David Culler, Scalable, distributed data structures for internet service construction, Proceedings of the 4th conference on Symposium on Operating System Design & Implementation, p.22-22, October 22-25, 2000, San Diego, California
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
[17] I. Lee and R. Iyer. Faults, Symptoms, and Software Fault Tolerance in the Tandem GUARDIAN90 Operating System. In Proceedings of International Symposium on Fault-Tolerant Computing (FTCS-23), pages 20-29, 1993.
|
| |
18
|
|
| |
19
|
[19] X. Li, R. P. Martin, K. Nagaraja, T. D. Nguyen, and B. Zhang. Mendosus: A SAN-Based Fault-Injection Test-Bed for the Construction of Highly Available Network Services. In Proceedings of the 1st Workshop on Novel Uses of System Area Networks (SAN-1), Cambridge, MA, Jan. 2002.
|
| |
20
|
[20] Linux virtual server project. http://www.linuxvirtualserver.org/.
|
| |
21
|
[21] D. D. E. Long, J. L. Carroll, and C. J. Park. A Study of the Reliability of Internet Sites. In Proceedings of the Tenth Symposium on Reliable Distributed Systems, pages 177-186, Sept. 1991.
|
| |
22
|
[22] B. Murphy and B. Levidow. Windows 2000 Dependability. (MSR-TR-2000-56), June 2000.
|
| |
23
|
[23] K. Nagaraja, R. Bianchini, R. Martin, and T. D. Nguyen. Using Fault Model Enforcement to Improve Availability. In Proceedings of the Second Workshop on Evaluating and Architecting System dependabilitY (EASY), Oct. 2002.
|
| |
24
|
Kiran Nagaraja , Xiaoyan Li , Ricardo Bianchini , Richard P. Martin , Thu D. Nguyen, Using fault injection and modeling to evaluate the performability of cluster-based services, Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems, p.2-2, March 26-28, 2003, Seattle, WA
|
 |
25
|
Vivek S. Pai , Mohit Aron , Gaurov Banga , Michael Svendsen , Peter Druschel , Willy Zwaenepoel , Erich Nahum, Locality-aware request distribution in cluster-based network servers, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.205-216, October 02-07, 1998, San Jose, California, United States
|
 |
26
|
David A. Patterson , Garth Gibson , Randy H. Katz, A case for redundant arrays of inexpensive disks (RAID), Proceedings of the 1988 ACM SIGMOD international conference on Management of data, p.109-116, June 01-03, 1988, Chicago, Illinois, United States
|
 |
27
|
Yasushi Saito , Brian N. Bershad , Henry M. Levy, Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service, Proceedings of the seventeenth ACM symposium on Operating systems principles, p.1-15, December 12-15, 1999, Charleston, South Carolina, United States
|
| |
28
|
[28] Service Monitoring Daemon, Apr. 2003. Available at http://www.kernel.org/software/mon/.
|
| |
29
|
[29] M. Sullivan and R. Chillarege. Software Defects and their Impact on System Availability - A Study of Field Failures in Operating Systems. In Proceedings of the 21st International Symposium on Fault-Tolerant Computing (FTCS-21), pages 2-9, Montreal, Canada, 1991.
|
| |
30
|
|
 |
31
|
Matt Welsh , David Culler , Eric Brewer, SEDA: an architecture for well-conditioned, scalable internet services, Proceedings of the eighteenth ACM symposium on Operating systems principles, October 21-24, 2001, Banff, Alberta, Canada
|
|