ACM Home Page
Please provide us with feedback. Feedback
Improving cluster availability using workstation validation
Full text PdfPdf (202 KB)
Source Joint International Conference on Measurement and Modeling of Computer Systems archive
Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems table of contents
Marina Del Rey, California
SESSION: Distributed systems table of contents
Pages: 217 - 227  
Year of Publication: 2002
ISBN:1-58113-531-9
Also published in ...
Authors
Taliver Heath  Rutgers University, Piscataway, NJ
Richard P. Martin  Rutgers University, Piscataway, NJ
Thu D. Nguyen  Rutgers University, Piscataway, NJ
Sponsor
SIGMETRICS: ACM Special Interest Group on Measurement and Evaluation
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 1,   Downloads (12 Months): 25,   Citation Count: 9
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/511334.511362
What is a DOI?

ABSTRACT

We demonstrate a framework for improving the availability of cluster based Internet services. Our approach models Internet services as a collection of interconnected components, each possessing well defined interfaces and failure semantics. Such a decomposition allows designers to engineer high availability based on an understanding of the interconnections and isolated fault behavior of each component, as opposed to ad-hoc methods. In this work, we focus on using the entire commodity workstation as a component because it possesses natural, fault-isolated interfaces. We define a failure event as a reboot because not only is a workstation unavailable during a reboot, but also because reboots are symptomatic of a larger class of failures, such as configuration and operator errors. Our observations of 3 distinct clusters show that the time between reboots is best modeled by a Weibull distribution with shape parameters of less than 1, implying that a workstation becomes more reliable the longer it has been operating. Leveraging this observed property, we design an allocation strategy which withholds recently rebooted workstations from active service, validating their stability before allowing them to return to service. We show via simulation that this policy leads to a 70-30 rule-of-thumb: For a constant utilization, approximately 70% of the workstation failures can be masked from end clients with 30% extra capacity added to the cluster, provided reboots are not strongly correlated. We also found our technique is most sensitive to the burstiness of reboots as opposed to absolute lengths of workstation uptimes.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
A. Brown and D. A. Patterson. Towards Availability Benchmarks: A Case Study of Software RAID Systems. In 2000 USENIX Annual Technical Conference, June 2000.
 
4
A. Brown and D. A. Patterson. To Err is Human. In First Workshop on Evaluating and Architecting System dependabilitY (EASY '01), July 2001.
 
5
 
6
 
7
A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An Empirical Study of Operating Systems Errors, October 2001.
 
8
Claudia O'Keefe. Doomed by eBay. http://salon.com/tech/feature/2000/10/27/doomed_by_ebay/index.html, Oct. 2000.
 
9
10
 
11
J. Gray. Why do Computers Stop and What Can Be Done About It? In Proceedings Fifth Symposium on Reliability in Distributed Software and Database Systems, Jan. 1986.
 
12
J. Gray. A Census of Tandem System Availability Between 1985 and 1990. IEEE Transactions on Reliability, 39(4):409-418, Oct. 1990.
 
13
J. Hu. Britannica.com crippled by user volume. http://news.cnet.com/news/0-1006-200-920536.html, Oct. 1999.
 
14
 
15
R. Jain. The Art of Computer Systems Performance Analysis. John Wiley & Sons, 1991.
 
16
 
17
B. Murphy and T. Gent. Measuring System and Software Reliability using an Automated Data Collection Process. Quality and Reliability Engineering International, 11:341-353, 1995.
 
18
B. Murphy and B. Levidow. Windows 2000 Dependability. In IEEE International Conference on Dependable Systems and Networks (DSN), June 2000.
 
19
S. Ross. A First Course in Probability. Prentice Hall, 2002.
 
20
SiliconValley.internet.com. Ebay Outage Twice This Week. http://siliconvalley.internet.com/news/article/0,,3531_435741,00.html, Aug. 2000.
 
21
22
 
23
 
24
R. V. White. An Introduction to Six Sigma With a Design Example. In Seventh Annual Applied Power Electronics Conference and Exposition (APEC '92), Feb. 1992.
 
25

CITED BY  9
Collaborative Colleagues:
Taliver Heath: colleagues
Richard P. Martin: colleagues
Thu D. Nguyen: colleagues