|
ABSTRACT
Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: Required application restructuring, inability to address deterministic software bugs, unsafe speculation on program execution, and long recovery time.This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and non-deterministic. Our idea, inspired from allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to re-execute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the "allergen" from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.We have implemented RX on Linux. Our experiments with four server applications that contain six bugs of various types show that RX can survive all the six software failures and provide transparent fast recovery within 0.017-0.16 seconds, 21-53 times faster than the whole program restart approach for all but one case (CVS). In contrast, the two tested alternatives, a whole program restart approach and a simple rollback and re-execution without environmental changes, cannot successfully recover the three servers (Squid, Apache, and CVS) that contain deterministic bugs, and have only a 40% recovery rate for the server (MySQL) that contains a non-deterministic concurrency bug. Additionally, RX's checkpointing system is lightweight, imposing small time and space overheads.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
A. Avizienis. The N-version approach to fault-tolerant software. IEEE Transactions on Software Engineering, SE-11(12), 1985.
|
| |
4
|
A. Avizienis and L. Chen. On the implementation of N-version programming for software fault tolerance during execution. In Proceedings of the 1st International Computer Software and Applications Conference, Nov 1977.
|
 |
5
|
|
| |
6
|
|
| |
7
|
A. Bobbio and M. Sereno. Fine grained software rejuvenation models. In Proceedings of the 1998 International Computer Performance and Dependability Symposium, Sep 1998.
|
| |
8
|
|
 |
9
|
Anita Borg , Jim Baumbach , Sam Glazer, A message system supporting fault tolerance, Proceedings of the ninth ACM symposium on Operating systems principles, p.90-99, October 10-13, 1983, Bretton Woods, New Hampshire, United States
|
 |
10
|
|
 |
11
|
|
| |
12
|
George Candea , James Cutler , Armando Fox , Rushabh Doshi , Priyank Garg , Rakesh Gowda, Reducing Recovery Time in a Small Recursively Restartable System, Proceedings of the 2002 International Conference on Dependable Systems and Networks, p.605-614, June 23-26, 2002
|
| |
13
|
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot -- A technique for cheap recovery. In Proceedings of the 6th Symposium on Operating System Design and Implementation, Dec 2004.
|
| |
14
|
|
| |
15
|
M. Castro and B. Liskov. Proactive recovery in a Byzantine-Fault-Tolerant system. In Proceedings of the 4th Symposium on Operating System Design and Implementation, Oct 2000.
|
| |
16
|
CERT/CC. Advisories. http://www.cert.org/advisories/.
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
 |
20
|
Jeremy Condit , Matthew Harren , Scott McPeak , George C. Necula , Westley Weimer, CCured in the real world, Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, June 09-11, 2003, San Diego, California, USA
|
| |
21
|
C. Cowan, C. Pu, D. Maier, J. Walpole, P. Bakke, S. Beattie, A. Grier, P. Wagle, Q. Zhang, and H. Hinton. StackGuard: Automatic adaptive detection and prevention of buffer-overflow attacks. In Proceedings of the 7th USENIX Security Symposium, Jan 1998.
|
 |
22
|
George W. Dunlap , Samuel T. King , Sukru Cinar , Murtaza A. Basrai , Peter M. Chen, ReVirt: enabling intrusion analysis through virtual-machine logging and replay, Proceedings of the 5th symposium on Operating systems design and implementation Due to copyright restrictions we are not able to make the PDFs for this conference available for downloading, December 09-11, 2002, Boston, Massachusetts
[doi> 10.1145/1060289.1060309]
|
 |
23
|
|
 |
24
|
|
| |
25
|
S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi. On the analysis of software rejuvenation policies. In Proceedings of the Annual Conference on Computer Assurance, Jun 1997.
|
| |
26
|
J. Gray. Why do computers stop and what can be done about it? In Proceedings of the 5th Symposium on Reliable Distributed Systems, Jan 1986.
|
| |
27
|
W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z.-Y. Yang. Characterization of Linux kernel behavior under errors. In Proceedings of the 2003 International Conference on Dependable Systems and Networks, Jun 2003.
|
| |
28
|
R. Hasting and B. Joyce. Purify: Fast detection of memory leaks and access errors. In Proceedings of the USENIX Winter 1992 Technical Conference, Dec 1992.
|
| |
29
|
|
 |
30
|
|
| |
31
|
|
 |
32
|
K. Li , J. F. Naughton , J. S. Plank, Real-time, concurrent checkpoint for parallel programs, Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming, p.79-88, March 14-16, 1990, Seattle, Washington, United States
|
| |
33
|
D. E. Lowell, S. Chandra, and P. M. Chen. Exploring failure transparency and the limits of generic recovery. In Proceedings of the 4th Symposium on Operating System Design and Implementation, Oct 2000.
|
 |
34
|
|
| |
35
|
D. E. Lowell and P. M. Chen. Discount checking: Transparent, low-overhead recovery for general applications. Technical report, CSE-TR-410-99, University of Michigan, Jul 1998.
|
| |
36
|
E. Marcus and H. Stern. Blueprints for High Availability. John Willey & Sons, 2000.
|
 |
37
|
|
| |
38
|
David Patterson , Aaron Brown , Pete Broadwell , George Candea , Mike Chen , James Cutler , Patricia Enriquez , Armando Fox , Emre Kiciman , Matthew Merzbacher , David Oppenheimer , Naveen Sastry , William Tetzlaff , Jonathan Traupman , Noah Treuhaft, Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,, University of California at Berkeley, Berkeley, CA, 2002
|
| |
39
|
|
| |
40
|
|
| |
41
|
B. Randell. System structure for software fault tolerance. IEEE Transactions on Software Engineering, 1(2):220--232, 1975.
|
 |
42
|
|
| |
43
|
M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, T. Leu, and W. S. Beebee, Jr. Enhancing server availability and security through failure-oblivious computing. In Proceedings of the 6th Symposium on Operating System Design and Implementation, Dec 2004.
|
 |
44
|
|
 |
45
|
|
 |
46
|
Douglas S. Santry , Michael J. Feeley , Norman C. Hutchinson , Alistair C. Veitch , Ross W. Carton , Jacob Ofir, Deciding when to forget in the Elephant file system, Proceedings of the seventeenth ACM symposium on Operating systems principles, p.110-123, December 12-15, 1999, Charleston, South Carolina, United States
|
| |
47
|
D. Scott. Assessing the costs of application downtime. Gartner Group, May 1998.
|
| |
48
|
S. Sidiroglou, M. E. Locasto, S. W. Boyd, and A. D. Keromytis. Building a reactive immune system for software services. In Proceedings of the USENIX 2005 Annual Technical Conference, Apr 2005.
|
| |
49
|
S. Srinivasan, C. Andrews, S. Kandula, and Y. Zhou. Flashback: A light-weight extension for rollback and deterministic replay for software debugging. In Proceedings of the USENIX 2004 Annual Technical Conference, Jun 2004.
|
| |
50
|
|
| |
51
|
S. D. Stoller. Testing concurrent Java programs using randomized scheduling. In Proceedings of the 2nd Workshop on Runtime Verification, Jul 2002.
|
 |
52
|
|
| |
53
|
M. Sullivan and R. Chillarege. Software defects and their impact on system availability -- A study of field failures in operating systems. In Proceedings of the 21th Annual International Symposium on Fault-Tolerant Computing, Jun 1991.
|
| |
54
|
M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering device drivers. In Proceedings of the 6th Symposium on Operating System Design and Implementation, Dec 2004.
|
| |
55
|
G. Trent and M. Sake. Webstone: The first generation in http server benchmarking, 1995.
|
| |
56
|
W. Vogels, D. Dumitriu, A. Agrawal, T. Chia, and K. Guo. Scalability of the Microsoft Cluster Service. In Proceedings of the 2nd USENIX Windows NT Symposium, Aug 1998.
|
| |
57
|
W. Vogels , D. Dumitriu , K. Birman , R. Gamache , M. Massa , R. Short , J. Vert , J. Barrera , J. Gray, The Design and Architecture of the Microsoft Cluster Service - A Practical Approach to High-Availability and Scalability, Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, p.422, June 23-25, 1998
|
| |
58
|
Y.-M. Wang, Y. Huang, and W. K. Fuchs. Progressive retry for software error recovery in distributed systems. In Proceedings of the 23rd Annual International Symposium on Fault-Tolerant Computing, Jun 1993.
|
| |
59
|
|
 |
60
|
|
CITED BY 51
|
|
Xiaoqi Jia , Shengzhi Zhang , Jiwu Jing , Peng Liu, Using virtual machines to do cross-layer damage assessment, Proceedings of the 1st ACM workshop on Virtual machine security, October 27-27, 2008, Alexandria, Virginia, USA
|
|
|
Sangeetha Seshadri , Lawrence Chiu , Cornel Constantinescu , Subashini Balachandran , Clem Dickey , Ling Liu , Paul Muench, Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling, Proceedings of the 6th USENIX Conference on File and Storage Technologies, p.1-16, February 26-29, 2008, San Jose, California
|
|
|
|
|
|
Michael E. Locasto , Angelos Stavrou , Gabriela F. Cretu , Angelos D. Keromytis, From STEM to SEAD: speculative execution for automated defense, 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, p.1-14, June 17-22, 2007, Santa Clara, CA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Daniela A. S. de Oliveira , Jedidiah R. Crandall , Gary Wassermann , S. Felix Wu , Zhendong Su , Frederic T. Chong, ExecRecorder: VM-based full-system replay for attack analysis and system recovery, Proceedings of the 1st workshop on Architectural and system support for improving software dependability, p.66-71, October 21-21, 2006, San Jose, California
|
|
|
|
|
|
|
|
|
Joseph Tucek , James Newsome , Shan Lu , Chengdu Huang , Spiros Xanthos , David Brumley , Yuanyuan Zhou , Dawn Song, Sweeper: a lightweight end-to-end system for defending against fast worms, ACM SIGOPS Operating Systems Review, v.41 n.3, June 2007
|
|
|
Edmund B. Nightingale , Kaushik Veeraraghavan , Peter M. Chen , Jason Flinn, Rethink the sync, Proceedings of the 7th conference on USENIX Symposium on Operating Systems Design and Implementation, p.1-1, November 06-08, 2006, Seattle, WA
|
|
|
|
|
|
Yan Tang , Yan Tang , Qi Gao , Qi Gao , Feng Qin , Feng Qin, LeakSurvivor: towards safely tolerating memory leaks for garbage-collected languages, USENIX 2008 Annual Technical Conference on Annual Technical Conference, p.307-320, June 22-27, 2008, Boston, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yih Huang , Angelos Stavrou , Anup K. Ghosh , Sushil Jajodia, Efficiently tracking application interactions using lightweight virtualization, Proceedings of the 1st ACM workshop on Virtual machine security, October 27-27, 2008, Alexandria, Virginia, USA
|
|
|
Shimin Chen , Michael Kozuch , Theodoros Strigkos , Babak Falsafi , Phillip B. Gibbons , Todd C. Mowry , Vijaya Ramachandran , Olatunji Ruwase , Michael Ryan , Evangelos Vlachos, Flexible Hardware Acceleration for Instruction-Grain Program Monitoring, ACM SIGARCH Computer Architecture News, v.36 n.3, p.377-388, June 2008
|
|
|
|
|
|
|
|
|
Manuel Costa , Jon Crowcroft , Miguel Castro , Antony Rowstron , Lidong Zhou , Lintao Zhang , Paul Barham, Vigilante: End-to-end containment of Internet worm epidemics, ACM Transactions on Computer Systems (TOCS), v.26 n.4, p.1-68, December 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Maysam Yabandeh , Nikola Knezevic , Dejan Kostic , Viktor Kuncak, CrystalBall: predicting and preventing inconsistencies in deployed distributed systems, Proceedings of the 6th USENIX symposium on Networked systems design and implementation, p.229-244, April 22-24, 2009, Boston, Massachusetts
|
|
|
Benjamin Wester , James Cowling , Edmund B. Nightingale , Peter M. Chen , Jason Flinn , Barbara Liskov, Tolerating latency in replicated state machines through client speculation, Proceedings of the 6th USENIX symposium on Networked systems design and implementation, p.245-260, April 22-24, 2009, Boston, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|