|
ABSTRACT
Optimistic Recovery is a new technique supporting application-independent transparent recovery from processor failures in distributed systems. In optimistic recovery communication, computation and checkpointing proceed asynchronously. Synchronization is replaced by causal dependency tracking, which enables a posteriori reconstruction of a consistent distributed system state following a failure using process rollback and message replay.
Because there is no synchronization among computation, communication, and checkpointing, optimistic recovery can tolerate the failure of an arbitrary number of processors and yields better throughput and response time than other general recovery techniques whenever failures are infrequent.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
AGHILI, H., KIM, W., MCPHERSON, J., SCHKOLNICK, M, AND STRONG, R. A highly available database system. IBM Research Rep. RJ 3755, IBM, Jan. 1983.
|
| |
2
|
BARTLETT, J. F. A 'nonstop' operating system. In l lth Hawaii International Conference on System Sciences. University of Hawaii, 1978.
|
 |
3
|
|
 |
4
|
Anita Borg , Jim Baumbach , Sam Glazer, A message system supporting fault tolerance, Proceedings of the ninth ACM symposium on Operating systems principles, p.90-99, October 10-13, 1983, Bretton Woods, New Hampshire, United States
|
 |
5
|
|
 |
6
|
|
 |
7
|
Jim Gray , Paul McJones , Mike Blasgen , Bruce Lindsay , Raymond Lorie , Tom Price , Franco Putzolu , Irving Traiger, The Recovery Manager of the System R Database Manager, ACM Computing Surveys (CSUR), v.13 n.2, p.223-242, June 1981
[doi> 10.1145/356842.356847]
|
| |
8
|
JEFFERSON, D. Virtual time. USC Tech. Rep. TR-83-213, Univ. of Southern California, Los Angeles, May 1983.
|
 |
9
|
|
| |
10
|
LAMPSON, B., AND STURGIS, H. Crash recovery in a distributed storage system. Xerox PARC Tech. Rep., Xerox Palo Alto Research Center, Palo Alto, Calif., Apr. 1979.
|
 |
11
|
|
 |
12
|
|
| |
13
|
MOHAN, C., STRONG, H. R., ANt) FINKELSTEIN, S. Method for distributed transaction commit and recovery using Byzantine agreement within clusters of processors. IBM Res. Rep. RJ 3882, IBM, San Jose, Calif., June 1983.
|
| |
14
|
RUSSELL, D.L. State restoration in systems of communicating processes. IEEE Trans. Softw. Eng. SE-6, (2), (Mar. 1980), 193-194.
|
| |
15
|
SCHNEIDER, F.B. Fail-stop processors. In Digest of Papers from Spring Compcon '83 (Mar.). IEEE Computer Society, San Francisco, 1983.
|
| |
16
|
SCOTT, R. K., GAULT, J. W., MCALL~STER, D. G., AND W~GGS, J. Experimental validation of six fault-tolerant software reliability models. In Proceedings of 14th Annual Symposium on Fault- Tolerant Computer Systems (Kissimmee, Fla., June 20-22). 1984.
|
| |
17
|
STROM, R. E., AND YEMINI, S. Optimistic recovery: An asynchronous approach to fault tolerance in distributed systems. Proceedings of the 14th Annual Symposium on Fault Tolerant Computer Systems (June 20-22, 1984).
|
| |
18
|
STROM, R., AND YEMINI, S. Synthesizing distributed and parallel programs through optimistic transformations. IBM Res. Rep. RC 10797, IBM, 1984.
|
| |
19
|
|
CITED BY 125
|
|
|
|
|
|
|
|
|
|
|
Alon Kleinman , Yael Moscowitz , Amir Pnueli , Ehud Sharpio, Communication with directed logic variables, Proceedings of the 18th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, p.221-232, January 21-23, 1991, Orlando, Florida, United States
|
|
|
|
|
|
Florin Sultan , Liviu Iftode , Thu Nguyen, Scalable fault-tolerant distributed shared memory, Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), p.20-es, November 04-10, 2000, Dallas, Texas, United States
|
|
|
Nuno Neves , Miguel Castro , Paulo Guedes, A checkpoint protocol for an entry consistent shared memory system, Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing, p.121-129, August 14-17, 1994, Los Angeles, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Manhoi Choy , Hong V. Leong , Man Hon Wong, On distributed object checkpointing and recovery, Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing, p.64-73, August 20-23, 1995, Ottowa, Ontario, Canada
|
|
|
|
|
|
|
|
|
Haim Gaifman , Michael J. Maher , Ehud Shapiro, Replay, recovery, replication, and snapshots of nondeterministic concurrent programs, Proceedings of the tenth annual ACM symposium on Principles of distributed computing, p.241-255, August 19-21, 1991, Montreal, Quebec, Canada
|
|
|
|
|
|
George Bosilca , Aurelien Bouteiller , Franck Cappello , Samir Djilali , Gilles Fedak , Cecile Germain , Thomas Herault , Pierre Lemarinier , Oleg Lodygensky , Frederic Magniette , Vincent Neri , Anton Selikhov, MPICH-V: toward a scalable fault tolerant MPI for volatile nodes, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, p.1-18, November 16, 2002, Baltimore, Maryland
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Frederic T. Chong , Thomas F. Knight, Jr., Design and performance of multipath MIN architectures, Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures, p.286-295, June 29-July 01, 1992, San Diego, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jacob Slonim , Patrick Finnigan , Alberto Mendelson , Toby Teorey , Michael Bauer , Paul Larson , Richard McBride , Yechiam Yemini , Shaula Yemini, Towards a new distributed programming environment (CORDS), Proceedings of the 1991 conference of the Centre for Advanced Studies on Collaborative research, October 28-30, 1991, Toronto, Ontario, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jacob Slonim , Patrick Finnigan , Alberto Mendelson , Toby Teorey , Michael Bauer , Paul Larson , Richard McBride , Yechiam Yemini , Shaula Yemini, Towards a new distributed programming environment (CORDS), Proceedings of the 1991 conference of the Centre for Advanced Studies on Collaborative research, October 28-30, 1991, Toronto, Ontario, Canada
|
|
|
|
|
|
|
|
|
|
|
|
J. S. Auerbach , D. F. Bacon , A. P. Goldberg , G. S. Goldszmidt , M. T. Kennedy , A. R. Lowry , J. R. Russell , W. Silverman , R. E. Strom , D. M. Yellin , S. A. Yemini, High-level language support for programming distributed systems, Proceedings of the 1991 conference of the Centre for Advanced Studies on Collaborative research, October 28-30, 1991, Toronto, Ontario, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sayantan Chakravorty , Celso L. Mendes , Laxmikant V. Kalé , Terry Jones , Andrew Tauferner , Todd Inglett , José Moreira, HPC-Colony: services and interfaces for very large systems, ACM SIGOPS Operating Systems Review, v.40 n.2, April 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Camille Coti , Thomas Herault , Pierre Lemarinier , Laurence Pilard , Ala Rezmerita , Eric Rodriguez , Franck Cappello, MPI tools and performance studies---Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, November 11-17, 2006, Tampa, Florida
|
|
|
Ramakrishna Gummadi , Nupur Kothari , Todd Millstein , Ramesh Govindan, Declarative failure recovery for sensor networks, Proceedings of the 6th international conference on Aspect-oriented software development, March 12-16, 2007, Vancouver, British Columbia, Canada
|
|
|
|
|
|
|
|
|
Sudarshan M. Srinivasan , Srikanth Kandula , Christopher R. Andrews , Yuanyuan Zhou, Flashback: a lightweight extension for rollback and deterministic replay for software debugging, Proceedings of the USENIX Annual Technical Conference 2004 on USENIX Annual Technical Conference, p.3-3, June 27-July 02, 2004, Boston, MA
|
|
|
Edmund B. Nightingale , Kaushik Veeraraghavan , Peter M. Chen , Jason Flinn, Rethink the sync, Proceedings of the 7th conference on USENIX Symposium on Operating Systems Design and Implementation, p.1-1, November 06-08, 2006, Seattle, WA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bouteiller Bouteiller , Franck Cappello , Thomas Herault , Krawezik Krawezik , Pierre Lemarinier , Magniette Magniette, MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p.25, November 15-21, 2003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
REVIEW
"Brent T. Hailpern : Reviewer"
This paper discusses a processor-failure recovery technique for a system of
communicating processes. The technique is implemented as a component of the
process-scheduling and network-interface portion of the operating system. By
being part of th
more...
|