ACM Home Page
Please provide us with feedback. Feedback
Optimistic recovery in distributed systems
Full text PdfPdf (1.75 MB)
Source ACM Transactions on Computer Systems (TOCS) archive
Volume 3 ,  Issue 3  (August 1985) table of contents
Pages: 204 - 226  
Year of Publication: 1985
ISSN:0734-2071
Authors
Rob Strom  IBM Thomas J. Watson Research Center, Yorktown Heights, NY
Shaula Yemini  IBM Thomas J. Watson Research Center, Yorktown Heights, NY
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 104,   Citation Count: 125
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/3959.3962
What is a DOI?

ABSTRACT

Optimistic Recovery is a new technique supporting application-independent transparent recovery from processor failures in distributed systems. In optimistic recovery communication, computation and checkpointing proceed asynchronously. Synchronization is replaced by causal dependency tracking, which enables a posteriori reconstruction of a consistent distributed system state following a failure using process rollback and message replay.

Because there is no synchronization among computation, communication, and checkpointing, optimistic recovery can tolerate the failure of an arbitrary number of processors and yields better throughput and response time than other general recovery techniques whenever failures are infrequent.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
AGHILI, H., KIM, W., MCPHERSON, J., SCHKOLNICK, M, AND STRONG, R. A highly available database system. IBM Research Rep. RJ 3755, IBM, Jan. 1983.
 
2
BARTLETT, J. F. A 'nonstop' operating system. In l lth Hawaii International Conference on System Sciences. University of Hawaii, 1978.
3
4
5
6
7
 
8
JEFFERSON, D. Virtual time. USC Tech. Rep. TR-83-213, Univ. of Southern California, Los Angeles, May 1983.
9
 
10
LAMPSON, B., AND STURGIS, H. Crash recovery in a distributed storage system. Xerox PARC Tech. Rep., Xerox Palo Alto Research Center, Palo Alto, Calif., Apr. 1979.
11
12
 
13
MOHAN, C., STRONG, H. R., ANt) FINKELSTEIN, S. Method for distributed transaction commit and recovery using Byzantine agreement within clusters of processors. IBM Res. Rep. RJ 3882, IBM, San Jose, Calif., June 1983.
 
14
RUSSELL, D.L. State restoration in systems of communicating processes. IEEE Trans. Softw. Eng. SE-6, (2), (Mar. 1980), 193-194.
 
15
SCHNEIDER, F.B. Fail-stop processors. In Digest of Papers from Spring Compcon '83 (Mar.). IEEE Computer Society, San Francisco, 1983.
 
16
SCOTT, R. K., GAULT, J. W., MCALL~STER, D. G., AND W~GGS, J. Experimental validation of six fault-tolerant software reliability models. In Proceedings of 14th Annual Symposium on Fault- Tolerant Computer Systems (Kissimmee, Fla., June 20-22). 1984.
 
17
STROM, R. E., AND YEMINI, S. Optimistic recovery: An asynchronous approach to fault tolerance in distributed systems. Proceedings of the 14th Annual Symposium on Fault Tolerant Computer Systems (June 20-22, 1984).
 
18
STROM, R., AND YEMINI, S. Synthesizing distributed and parallel programs through optimistic transformations. IBM Res. Rep. RC 10797, IBM, 1984.
 
19

CITED BY  125


REVIEW

"Brent T. Hailpern : Reviewer"

This paper discusses a processor-failure recovery technique for a system of communicating processes. The technique is implemented as a component of the process-scheduling and network-interface portion of the operating system. By being part of th  more...

Collaborative Colleagues:
Rob Strom: colleagues
Shaula Yemini: colleagues