ACM Home Page
Please provide us with feedback. Feedback
Rx: treating bugs as allergies---a safe method to survive software failures
Full text PdfPdf (245 KB)
Source ACM SIGOPS Operating Systems Review archive
Volume 39 ,  Issue 5  (December 2005) table of contents
SOSP '05
SESSION: Bugs table of contents
Pages: 235 - 248  
Year of Publication: 2005
ISSN:0163-5980
Also published in ...
Authors
Feng Qin  University of Illinois at Urbana Champaign
Joseph Tucek  University of Illinois at Urbana Champaign
Jagadeesan Sundaresan  University of Illinois at Urbana Champaign
Yuanyuan Zhou  University of Illinois at Urbana Champaign
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 42,   Downloads (12 Months): 143,   Citation Count: 51
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1095809.1095833
What is a DOI?

ABSTRACT

Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: Required application restructuring, inability to address deterministic software bugs, unsafe speculation on program execution, and long recovery time.This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and non-deterministic. Our idea, inspired from allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to re-execute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the "allergen" from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.We have implemented RX on Linux. Our experiments with four server applications that contain six bugs of various types show that RX can survive all the six software failures and provide transparent fast recovery within 0.017-0.16 seconds, 21-53 times faster than the whole program restart approach for all but one case (CVS). In contrast, the two tested alternatives, a whole program restart approach and a simple rollback and re-execution without environmental changes, cannot successfully recover the three servers (Squid, Apache, and CVS) that contain deterministic bugs, and have only a 40% recovery rate for the server (MySQL) that contains a non-deterministic concurrency bug. Additionally, RX's checkpointing system is lightweight, imposing small time and space overheads.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
A. Avizienis. The N-version approach to fault-tolerant software. IEEE Transactions on Software Engineering, SE-11(12), 1985.
 
4
A. Avizienis and L. Chen. On the implementation of N-version programming for software fault tolerance during execution. In Proceedings of the 1st International Computer Software and Applications Conference, Nov 1977.
5
 
6
 
7
A. Bobbio and M. Sereno. Fine grained software rejuvenation models. In Proceedings of the 1998 International Computer Performance and Dependability Symposium, Sep 1998.
 
8
9
10
11
 
12
 
13
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot -- A technique for cheap recovery. In Proceedings of the 6th Symposium on Operating System Design and Implementation, Dec 2004.
 
14
 
15
M. Castro and B. Liskov. Proactive recovery in a Byzantine-Fault-Tolerant system. In Proceedings of the 4th Symposium on Operating System Design and Implementation, Oct 2000.
 
16
CERT/CC. Advisories. http://www.cert.org/advisories/.
 
17
 
18
19
20
 
21
C. Cowan, C. Pu, D. Maier, J. Walpole, P. Bakke, S. Beattie, A. Grier, P. Wagle, Q. Zhang, and H. Hinton. StackGuard: Automatic adaptive detection and prevention of buffer-overflow attacks. In Proceedings of the 7th USENIX Security Symposium, Jan 1998.
22
23
24
 
25
S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi. On the analysis of software rejuvenation policies. In Proceedings of the Annual Conference on Computer Assurance, Jun 1997.
 
26
J. Gray. Why do computers stop and what can be done about it? In Proceedings of the 5th Symposium on Reliable Distributed Systems, Jan 1986.
 
27
W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z.-Y. Yang. Characterization of Linux kernel behavior under errors. In Proceedings of the 2003 International Conference on Dependable Systems and Networks, Jun 2003.
 
28
R. Hasting and B. Joyce. Purify: Fast detection of memory leaks and access errors. In Proceedings of the USENIX Winter 1992 Technical Conference, Dec 1992.
 
29
30
 
31
32
 
33
D. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure transparency and the limits of generic recovery. In Proceedings of the 4th Symposium on Operating System Design and Implementation, Oct 2000.
34
 
35
D. E. Lowell and P. M. Chen. Discount checking: Transparent, low-overhead recovery for general applications. Technical report, CSE-TR-410-99, University of Michigan, Jul 1998.
 
36
E. Marcus and H. Stern. Blueprints for High Availability. John Willey & Sons, 2000.
37
 
38
 
39
 
40
 
41
B. Randell. System structure for software fault tolerance. IEEE Transactions on Software Engineering, 1(2):220--232, 1975.
42
 
43
M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, T. Leu, and W. S. Beebee, Jr. Enhancing server availability and security through failure-oblivious computing. In Proceedings of the 6th Symposium on Operating System Design and Implementation, Dec 2004.
44
45
46
 
47
D. Scott. Assessing the costs of application downtime. Gartner Group, May 1998.
 
48
S. Sidiroglou, M. E. Locasto, S. W. Boyd, and A. D. Keromytis. Building a reactive immune system for software services. In Proceedings of the USENIX 2005 Annual Technical Conference, Apr 2005.
 
49
S. Srinivasan, C. Andrews, S. Kandula, and Y. Zhou. Flashback: A light-weight extension for rollback and deterministic replay for software debugging. In Proceedings of the USENIX 2004 Annual Technical Conference, Jun 2004.
 
50
 
51
S. D. Stoller. Testing concurrent Java programs using randomized scheduling. In Proceedings of the 2nd Workshop on Runtime Verification, Jul 2002.
52
 
53
M. Sullivan and R. Chillarege. Software defects and their impact on system availability -- A study of field failures in operating systems. In Proceedings of the 21th Annual International Symposium on Fault-Tolerant Computing, Jun 1991.
 
54
M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering device drivers. In Proceedings of the 6th Symposium on Operating System Design and Implementation, Dec 2004.
 
55
G. Trent and M. Sake. Webstone: The first generation in http server benchmarking, 1995.
 
56
W. Vogels, D. Dumitriu, A. Agrawal, T. Chia, and K. Guo. Scalability of the Microsoft Cluster Service. In Proceedings of the 2nd USENIX Windows NT Symposium, Aug 1998.
 
57
 
58
Y.-M. Wang, Y. Huang, and W. K. Fuchs. Progressive retry for software error recovery in distributed systems. In Proceedings of the 23rd Annual International Symposium on Fault-Tolerant Computing, Jun 1993.
 
59
60

CITED BY  51

Collaborative Colleagues:
Feng Qin: colleagues
Joseph Tucek: colleagues
Jagadeesan Sundaresan: colleagues
Yuanyuan Zhou: colleagues