|
ABSTRACT
We demonstrate a framework for improving the availability of cluster based Internet services. Our approach models Internet services as a collection of interconnected components, each possessing well defined interfaces and failure semantics. Such a decomposition allows designers to engineer high availability based on an understanding of the interconnections and isolated fault behavior of each component, as opposed to ad-hoc methods. In this work, we focus on using the entire commodity workstation as a component because it possesses natural, fault-isolated interfaces. We define a failure event as a reboot because not only is a workstation unavailable during a reboot, but also because reboots are symptomatic of a larger class of failures, such as configuration and operator errors. Our observations of 3 distinct clusters show that the time between reboots is best modeled by a Weibull distribution with shape parameters of less than 1, implying that a workstation becomes more reliable the longer it has been operating. Leveraging this observed property, we design an allocation strategy which withholds recently rebooted workstations from active service, validating their stability before allowing them to return to service. We show via simulation that this policy leads to a 70-30 rule-of-thumb: For a constant utilization, approximately 70% of the workstation failures can be masked from end clients with 30% extra capacity added to the cluster, provided reboots are not strongly correlated. We also found our technique is most sensitive to the burstiness of reboots as opposed to absolute lengths of workstation uptimes.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
A. Brown and D. A. Patterson. Towards Availability Benchmarks: A Case Study of Software RAID Systems. In 2000 USENIX Annual Technical Conference, June 2000.
|
| |
4
|
A. Brown and D. A. Patterson. To Err is Human. In First Workshop on Evaluating and Architecting System dependabilitY (EASY '01), July 2001.
|
| |
5
|
|
| |
6
|
|
| |
7
|
A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An Empirical Study of Operating Systems Errors, October 2001.
|
| |
8
|
Claudia O'Keefe. Doomed by eBay. http://salon.com/tech/feature/2000/10/27/doomed_by_ebay/index.html, Oct. 2000.
|
| |
9
|
|
 |
10
|
Armando Fox , Steven D. Gribble , Yatin Chawathe , Eric A. Brewer , Paul Gauthier, Cluster-based scalable network services, Proceedings of the sixteenth ACM symposium on Operating systems principles, p.78-91, October 05-08, 1997, Saint Malo, France
|
| |
11
|
J. Gray. Why do Computers Stop and What Can Be Done About It? In Proceedings Fifth Symposium on Reliability in Distributed Software and Database Systems, Jan. 1986.
|
| |
12
|
J. Gray. A Census of Tandem System Availability Between 1985 and 1990. IEEE Transactions on Reliability, 39(4):409-418, Oct. 1990.
|
| |
13
|
J. Hu. Britannica.com crippled by user volume. http://news.cnet.com/news/0-1006-200-920536.html, Oct. 1999.
|
| |
14
|
|
| |
15
|
R. Jain. The Art of Computer Systems Performance Analysis. John Wiley & Sons, 1991.
|
| |
16
|
|
| |
17
|
B. Murphy and T. Gent. Measuring System and Software Reliability using an Automated Data Collection Process. Quality and Reliability Engineering International, 11:341-353, 1995.
|
| |
18
|
B. Murphy and B. Levidow. Windows 2000 Dependability. In IEEE International Conference on Dependable Systems and Networks (DSN), June 2000.
|
| |
19
|
S. Ross. A First Course in Probability. Prentice Hall, 2002.
|
| |
20
|
SiliconValley.internet.com. Ebay Outage Twice This Week. http://siliconvalley.internet.com/news/article/0,,3531_435741,00.html, Aug. 2000.
|
| |
21
|
|
 |
22
|
Kalyanaraman Vaidyanathan , Richard E. Harper , Steven W. Hunter , Kishor S. Trivedi, Analysis and implementation of software rejuvenation in cluster systems, Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, p.62-71, June 2001, Cambridge, Massachusetts, United States
|
| |
23
|
|
| |
24
|
R. V. White. An Introduction to Six Sigma With a Design Example. In Seventh Annual Applied Power Electronics Conference and Exposition (APEC '92), Feb. 1992.
|
| |
25
|
|
CITED BY 9
|
|
Kiran Nagaraja , Gustavo Gama , Ricardo Bianchini , Richard P. Martin , Wagner Meira Jr. , Thu D. Nguyen, Quantifying the Performability of Cluster-Based Services, IEEE Transactions on Parallel and Distributed Systems, v.16 n.5, p.456-467, May 2005
|
|
|
Bianca Schroeder , Garth A. Gibson, Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?, Proceedings of the 5th conference on USENIX Conference on File and Storage Technologies, p.1-1, February 13-16, 2007, San Jose, CA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|