ACM Home Page
Please provide us with feedback. Feedback
A survey of rollback-recovery protocols in message-passing systems
Full text PdfPdf (550 KB)
Source ACM Computing Surveys (CSUR) archive
Volume 34 ,  Issue 3  (September 2002) table of contents
Pages: 375 - 408  
Year of Publication: 2002
ISSN:0360-0300
Authors
E. N. (Mootaz) Elnozahy  IBM Research, Austin, TX
Lorenzo Alvisi  The University of Texas at Austin, Austin, TX
Yi-Min Wang  Microsoft Research, Redmond, WA
David B. Johnson  Rice University, Houston, TX
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 66,   Downloads (12 Months): 563,   Citation Count: 100
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/568522.568525
What is a DOI?

ABSTRACT

This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback-recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
Appel, A. W. 1989. A runtime system. Technical Report CS-TR220-89, Department of Computer Science, Princeton University.
5
 
6
 
7
8
 
9
 
10
Bhargava, B. and Lian, S. R. 1988. Independent checkpointing and concurrent rollback for recovery---An optimistic approach. In Proceedings, Seventh Symposium on Reliable Distributed Systems, 3--12.
 
11
12
13
 
14
Briatico, D., Ciuffoletti, A., and Simoncini, L. 1984. A distributed domino-effect free recovery algorithm. In IEEE International Symposium on Reliability, Distributed Software, and Databases, 207--215.
 
15
Chandy, M. and Ramamoorthy, C. V. 1972. Rollback and recovery strategies for computer programs. IEEE Trans. Comput. 21, 6, 546--556.
16
 
17
Cristian, F. and Jahanian, F. 1991. A timestamp-based checkpointing protocol for long-lived distributed computations. In Proceedings, Tenth Symposium on Reliable Distributed Systems, 12--20.
 
18
 
19
 
20
Elnozahy, E. N. and Zwaenepoel, W. 1994. On the use and implementing of message logging. In Digest of Papers, FTCS-24, The Twenty Fourth International Symposium on Fault-Tolerant Computing, 298--307.
 
21
Elnozahy, E. N., Johnson, D. B., and Zwaenepoel, W. 1992. The performance of consistent checkpointing. In Proceedings, Eleventh Symposium on Reliable Distributed Systems, 39--47.
22
 
23
Goldberg, A., Gopal, A., Li, K., Strom, R., and Bacon, D. 1990. Transparent recovery of Mach applications. In Usenix Mach Workshop Proceedings, 169--184.
 
24
 
25
 
26
Huang, Y. and Kintala, C. 1993. Software implemented fault tolerance: Technologies and experience. In Digest of Papers, FTCS-23, the Twenty Third Annual International Symposium on Fault-Tolerant Computing, 2--9.
 
27
 
28
 
29
Johnson, D. B. and Zwaenepoel, W. 1987. Sender-based message logging. In Digest of Papers, FTCS-17, The Seventeenth Annual International Symposium on Fault-Tolerant Computing, 14--19.
 
30
 
31
Juang, T. T.-Y. and Venkatesan, S. 1991. Crash recovery with little overhead. In Proceedings, The 11th International Conference on Distributed Computing Systems, 454--461.
 
32
 
33
34
 
35
Lampson, B. W. and Sturgis, H. E. 1979. Crash recovery in a distributed data storage system. Technical Report, Xerox Palo Alto Research Center.
 
36
Li, C. C. and Fuchs, W. K. 1990. CATCH: Compiler-assisted techniques for checkpointing. In Digest of Papers, FTCS-20, The Twentieth Annual International Symposium on Fault-Tolerant Computing, 74--81.
37
 
38
 
39
 
40
 
41
 
42
 
43
 
44
Plank, J. S. and Li, K. 1994. Faster checkpointing with N + 1 parity. In Digest of Papers, FTCS-24, The Twenty Fourth Annual International Symposium on Fault-Tolerant Computing, 288--297.
 
45
Plank, J. S., Xu, J., and Netzer, R. H. 1995a. Compressed differences: An algorithm for fast incremental checkpointing. Technical Report CS-95-302, University of Tennessee at Knoxville.
 
46
Plank, J. S., Beck, M., Kingsley, G., and Li, K. 1995b. Libckpt: Transparent checkpointing under UNIX. In Proceedings of the USENIX Winter 1995 Technical Conference, 213--223.
 
47
 
48
Randell, B. 1975. System structure for software fault tolerance. IEEE Trans. Softw. Engin. 1, 2, 220--232.
 
49
 
50
Ruffin, M. 1992. KITLOG: A generic logging service. In Proceedings, Eleventh Symposium on Reliable Distributed Systems, 139--148.
 
51
Russell, D. L. 1980. State restoration in systems of communicating processes. IEEE Trans. Softw. Engin. 6, 2, 183--194.
52
 
53
Silva, L. M. 1997. Checkpointing Mechanisms for Scientific Parallel Applications. Ph.D. Thesis, University of Coimbra, Department of Computer Science.
54
 
55
 
56
57
 
58
Tamir, Y. and Sequin, C. H. 1984. Error recovery in multicomputers using global checkpoints. In Proceedings of the International Conference on Parallel Processing, 32--41.
 
59
 
60
 
61
 
62
Wang, Y.-M., Chung, P. Y., and Fuchs, W. K. 1995a. Tight upper bound on useful distributed system checkpoints. Technical Report, University of Illinois.
 
63

CITED BY  100


REVIEW

"Bayard Kohlhepp : Reviewer"

Computer applications now span the globe, and incorporate devices ranging in size and power from watches to clustered supercomputers. The further a system reaches, and the more its heterogeneity decreases, the more fragile (susceptible to exceptio  more...

Collaborative Colleagues:
E. N. (Mootaz) Elnozahy: colleagues
Lorenzo Alvisi: colleagues
Yi-Min Wang: colleagues
David B. Johnson: colleagues