|
ABSTRACT
This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback-recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
Appel, A. W. 1989. A runtime system. Technical Report CS-TR220-89, Department of Computer Science, Princeton University.
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
 |
8
|
|
| |
9
|
|
| |
10
|
Bhargava, B. and Lian, S. R. 1988. Independent checkpointing and concurrent rollback for recovery---An optimistic approach. In Proceedings, Seventh Symposium on Reliable Distributed Systems, 3--12.
|
| |
11
|
|
 |
12
|
|
 |
13
|
|
| |
14
|
Briatico, D., Ciuffoletti, A., and Simoncini, L. 1984. A distributed domino-effect free recovery algorithm. In IEEE International Symposium on Reliability, Distributed Software, and Databases, 207--215.
|
| |
15
|
Chandy, M. and Ramamoorthy, C. V. 1972. Rollback and recovery strategies for computer programs. IEEE Trans. Comput. 21, 6, 546--556.
|
 |
16
|
|
| |
17
|
Cristian, F. and Jahanian, F. 1991. A timestamp-based checkpointing protocol for long-lived distributed computations. In Proceedings, Tenth Symposium on Reliable Distributed Systems, 12--20.
|
| |
18
|
|
| |
19
|
|
| |
20
|
Elnozahy, E. N. and Zwaenepoel, W. 1994. On the use and implementing of message logging. In Digest of Papers, FTCS-24, The Twenty Fourth International Symposium on Fault-Tolerant Computing, 298--307.
|
| |
21
|
Elnozahy, E. N., Johnson, D. B., and Zwaenepoel, W. 1992. The performance of consistent checkpointing. In Proceedings, Eleventh Symposium on Reliable Distributed Systems, 39--47.
|
 |
22
|
|
| |
23
|
Goldberg, A., Gopal, A., Li, K., Strom, R., and Bacon, D. 1990. Transparent recovery of Mach applications. In Usenix Mach Workshop Proceedings, 169--184.
|
| |
24
|
|
| |
25
|
|
| |
26
|
Huang, Y. and Kintala, C. 1993. Software implemented fault tolerance: Technologies and experience. In Digest of Papers, FTCS-23, the Twenty Third Annual International Symposium on Fault-Tolerant Computing, 2--9.
|
| |
27
|
|
| |
28
|
|
| |
29
|
Johnson, D. B. and Zwaenepoel, W. 1987. Sender-based message logging. In Digest of Papers, FTCS-17, The Seventeenth Annual International Symposium on Fault-Tolerant Computing, 14--19.
|
| |
30
|
|
| |
31
|
Juang, T. T.-Y. and Venkatesan, S. 1991. Crash recovery with little overhead. In Proceedings, The 11th International Conference on Distributed Computing Systems, 454--461.
|
| |
32
|
|
| |
33
|
|
 |
34
|
|
| |
35
|
Lampson, B. W. and Sturgis, H. E. 1979. Crash recovery in a distributed data storage system. Technical Report, Xerox Palo Alto Research Center.
|
| |
36
|
Li, C. C. and Fuchs, W. K. 1990. CATCH: Compiler-assisted techniques for checkpointing. In Digest of Papers, FTCS-20, The Twentieth Annual International Symposium on Fault-Tolerant Computing, 74--81.
|
 |
37
|
|
| |
38
|
|
| |
39
|
|
| |
40
|
|
| |
41
|
|
| |
42
|
|
| |
43
|
|
| |
44
|
Plank, J. S. and Li, K. 1994. Faster checkpointing with N + 1 parity. In Digest of Papers, FTCS-24, The Twenty Fourth Annual International Symposium on Fault-Tolerant Computing, 288--297.
|
| |
45
|
Plank, J. S., Xu, J., and Netzer, R. H. 1995a. Compressed differences: An algorithm for fast incremental checkpointing. Technical Report CS-95-302, University of Tennessee at Knoxville.
|
| |
46
|
Plank, J. S., Beck, M., Kingsley, G., and Li, K. 1995b. Libckpt: Transparent checkpointing under UNIX. In Proceedings of the USENIX Winter 1995 Technical Conference, 213--223.
|
| |
47
|
|
| |
48
|
Randell, B. 1975. System structure for software fault tolerance. IEEE Trans. Softw. Engin. 1, 2, 220--232.
|
| |
49
|
|
| |
50
|
Ruffin, M. 1992. KITLOG: A generic logging service. In Proceedings, Eleventh Symposium on Reliable Distributed Systems, 139--148.
|
| |
51
|
Russell, D. L. 1980. State restoration in systems of communicating processes. IEEE Trans. Softw. Engin. 6, 2, 183--194.
|
 |
52
|
|
| |
53
|
Silva, L. M. 1997. Checkpointing Mechanisms for Scientific Parallel Applications. Ph.D. Thesis, University of Coimbra, Department of Computer Science.
|
 |
54
|
|
| |
55
|
|
| |
56
|
|
 |
57
|
|
| |
58
|
Tamir, Y. and Sequin, C. H. 1984. Error recovery in multicomputers using global checkpoints. In Proceedings of the International Conference on Parallel Processing, 32--41.
|
| |
59
|
|
| |
60
|
|
| |
61
|
|
| |
62
|
Wang, Y.-M., Chung, P. Y., and Fuchs, W. K. 1995a. Tight upper bound on useful distributed system checkpoints. Technical Report, University of Illinois.
|
| |
63
|
|
CITED BY 100
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Raphael Y. de Camargo , Andrei Goldchleger , Fabio Kon , Alfredo Goldman, Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware, Proceedings of the 2nd workshop on Middleware for grid computing, p.35-40, October 18-22, 2004, Toronto, Ontario, Canada
|
|
|
Rodrigo Schmidt , Islene C. Garcia , Fernando Pedone , Luiz E. Buzato, Brief announcement: optimal asynchronous garbage collection for checkpointing protocols with rollback-dependency trackability, Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing, July 25-28, 2004, St. John's, Newfoundland, Canada
|
|
|
|
|
|
Manuel Costa , Jon Crowcroft , Miguel Castro , Antony Rowstron , Lidong Zhou , Lintao Zhang , Paul Barham, Vigilante: end-to-end containment of internet worms, ACM SIGOPS Operating Systems Review, v.39 n.5, December 2005
|
|
|
|
|
|
|
|
|
|
|
|
Rosalia Christodoulopoulou , Kaloian Manassiev , Angelos Bilas , Cristiana Amza, Fast and transparent recovery for continuous availability of cluster-based servers, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, March 29-31, 2006, New York, New York, USA
|
|
|
|
|
|
George W. Dunlap , Samuel T. King , Sukru Cinar , Murtaza A. Basrai , Peter M. Chen, ReVirt: enabling intrusion analysis through virtual-machine logging and replay, Proceedings of the 5th symposium on Operating systems design and implementation Due to copyright restrictions we are not able to make the PDFs for this conference available for downloading, December 09-11, 2002, Boston, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
M. Jahanshahi , M. Gholipour , M. Kordafshari , M. Dehghan, Dependability evaluation of dedicated server group orphan detection method, Proceedings of the 9th WSEAS International Conference on Systems, p.1-6, July 11-13, 2005, Athens, Greece
|
|
|
M. Jahanshahi , M. Kordafshari , M. Gholipour , M. Dehghan, Preventing of burst traffic in DSG method, Proceedings of the 9th WSEAS International Conference on Systems, p.1-5, July 11-13, 2005, Athens, Greece
|
|
|
M. Jahanshahi , M. Gholipour , M. Kordafshari , M. Dehghan, Improvement of DSG method, Proceedings of the 4th WSEAS International Conference on Applied Mathematics and Computer Science, p.1-4, April 25-27, 2005, Rio de Janeiro, Brazil
|
|
|
|
|
|
Sanjay Bhansali , Wen-Ke Chen , Stuart de Jong , Andrew Edwards , Ron Murray , Milenko Drinić , Darek Mihočka , Joe Chau, Framework for instruction-level tracing and analysis of program executions, Proceedings of the second international conference on Virtual execution environments, June 14-16, 2006, Ottawa, Ontario, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dirk Koch , Christian Haubelt , Jürgen Teich, Efficient hardware checkpointing: concepts, overhead analysis, and implementation, Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays, February 18-20, 2007, Monterey, California, USA
|
|
|
Camille Coti , Thomas Herault , Pierre Lemarinier , Laurence Pilard , Ala Rezmerita , Eric Rodriguez , Franck Cappello, MPI tools and performance studies---Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, November 11-17, 2006, Tampa, Florida
|
|
|
Ramakrishna Gummadi , Nupur Kothari , Todd Millstein , Ramesh Govindan, Declarative failure recovery for sensor networks, Proceedings of the 6th international conference on Aspect-oriented software development, March 12-16, 2007, Vancouver, British Columbia, Canada
|
|
|
Nikos Chrisochoides , Andriy Fedorov , Andriy Kot , Neculai Archip , Peter Black , Olivier Clatz , Alexandra Golby , Ron Kikinis , Simon K. Warfield, Imaging and visual analysis---Toward real-time image guided neurosurgery using distributed and grid computing, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, November 11-17, 2006, Tampa, Florida
|
|
|
|
|
|
|
|
|
Shiding Lin , Aimin Pan , Zheng Zhang , Rui Guo , Zhenyu Guo, WiDS: an integrated toolkit for distributed system development, Proceedings of the 10th conference on Hot Topics in Operating Systems, p.17-17, June 12-15, 2005, Santa Fe, NM
|
|
|
K. G. Anagnostakis , S. Sidiroglou , P. Akritidis , K. Xinidis , E. Markatos , A. D. Keromytis, Detecting targeted attacks using shadow honeypots, Proceedings of the 14th conference on USENIX Security Symposium, p.9-9, July 31-August 05, 2005, Baltimore, MD
|
|
|
Sudarshan M. Srinivasan , Srikanth Kandula , Christopher R. Andrews , Yuanyuan Zhou, Flashback: a lightweight extension for rollback and deterministic replay for software debugging, Proceedings of the USENIX Annual Technical Conference 2004 on USENIX Annual Technical Conference, p.3-3, June 27-July 02, 2004, Boston, MA
|
|
|
Edmund B. Nightingale , Kaushik Veeraraghavan , Peter M. Chen , Jason Flinn, Rethink the sync, Proceedings of the 7th conference on USENIX Symposium on Operating Systems Design and Implementation, p.1-1, November 06-08, 2006, Seattle, WA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hyungsoo Jung , Dongin Shin , Hyuck Han , Jai W. Kim , Heon Y. Yeom , Jongsuk Lee, Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3), Proceedings of the 2005 ACM/IEEE conference on Supercomputing, p.32, November 12-18, 2005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Samuel T. King , Joseph Tucek , Anthony Cozzie , Chris Grier , Weihang Jiang , Yuanyuan Zhou, Designing and implementing malicious hardware, Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats, p.1-8, April 15-15, 2008, San Francisco, California
|
|
|
|
|
|
|
|
|
George W. Dunlap , Dominic G. Lucchetti , Michael A. Fetterman , Peter M. Chen, Execution replay of multiprocessor virtual machines, Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, March 05-07, 2008, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
J. N. Glosli , D. F. Richards , K. J. Caspersen , R. E. Rudd , J. A. Gunnels , F. H. Streitz, Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability, Proceedings of the 2007 ACM/IEEE conference on Supercomputing, November 10-16, 2007, Reno, Nevada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Petru Eles , Viacheslav Izosimov , Paul Pop , Zebo Peng, Synthesis of fault-tolerant embedded systems, Proceedings of the conference on Design, automation and test in Europe, March 10-14, 2008, Munich, Germany
|
|
|
|
|
|
|
|
|
|
|
|
Gunjan Khanna , Mike Yu Cheng , Padma Varadharajan , Saurabh Bagchi , Miguel P. Correia , Paulo J. Veríssimo, Automated Rule-Based Diagnosis through a Distributed Monitor System, IEEE Transactions on Dependable and Secure Computing, v.4 n.4, p.266-279, October 2007
|
|
|
|
|
|
|
|
|
|
|
|
Manuel Costa , Jon Crowcroft , Miguel Castro , Antony Rowstron , Lidong Zhou , Lintao Zhang , Paul Barham, Vigilante: End-to-end containment of Internet worm epidemics, ACM Transactions on Computer Systems (TOCS), v.26 n.4, p.1-68, December 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Benjamin Wester , James Cowling , Edmund B. Nightingale , Peter M. Chen , Jason Flinn , Barbara Liskov, Tolerating latency in replicated state machines through client speculation, Proceedings of the 6th USENIX symposium on Networked systems design and implementation, p.245-260, April 22-24, 2009, Boston, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
REVIEW
"Bayard Kohlhepp : Reviewer"
Computer applications now span the globe, and incorporate devices ranging in size and power from watches to clustered supercomputers. The further a system reaches, and the more its heterogeneity decreases, the more fragile (susceptible to exceptio
more...
|