|
ABSTRACT
The initial design for a distributed, fault-tolerant version of UNIX based on three-way atomic message transmission was presented in an earlier paper [3]. The implementation effort then moved from Auragen Systems1 to Nixdorf Computer where it was completed. This paper describes the working system, now known as the TARGON/32.
The original design left open questions in at least two areas: fault tolerance for server processes and recovery after a crash were briefly and inaccurately sketched, rebackup after recovery was not discussed at all. The fundamental design involving three-way message transmission has remained unchanged. However, in addition to important changes in the implementation, server backup has been redesigned and is now more consistent with that of normal user processes. Recovery and rebackup have been completed in a less centralized and thus more efficient manner than previously envisioned.
In this paper we review important aspects of the original design and note how the implementation differs from our original ideas. We then focus on the backup and recovery for server processes and the changes and additions in the design and implementation of recovery and rebackup.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
ARNOW, D., AND GLAZER, S. A fast safe file system for UNIX. Unpublished paper written in 1984 for Auragen Systems Corp., Ft. Lee, N.J.
|
 |
2
|
|
 |
3
|
Anita Borg , Jim Baumbach , Sam Glazer, A message system supporting fault tolerance, Proceedings of the ninth ACM symposium on Operating systems principles, p.90-99, October 10-13, 1983, Bretton Woods, New Hampshire, United States
|
 |
4
|
Jim Gray , Paul McJones , Mike Blasgen , Bruce Lindsay , Raymond Lorie , Tom Price , Franco Putzolu , Irving Traiger, The Recovery Manager of the System R Database Manager, ACM Computing Surveys (CSUR), v.13 n.2, p.223-242, June 1981
[doi> 10.1145/356842.356847]
|
 |
5
|
|
 |
6
|
|
| |
7
|
LISKOV, B., AND LADIN, R. Highly-available distributed services and fault-tolerant distributed garbage collection. Programming Methodology Group Memo 48, MIT Laboratory for Computer Science, May, 1986.
|
 |
8
|
|
 |
9
|
|
| |
10
|
RASHID, R., AND ROBERTSON, G. Accent: A communication-oriented network operating system kernel. Tech. Rep. CMU-CS-81-123, Dept. of Computer Science, Carnegie-Mellon Univ., Apr. 1981.
|
 |
11
|
|
| |
12
|
Stratus~32, VOS Re{erence Manual. Stratus Computers, Inc., Marlborough, Mass., 1982.
|
 |
13
|
|
| |
14
|
TOLERANT SYSTEMS, INC. Eternity series: Technology brief. Internal publication, July 1988, Tolerant Systems, San Jose, Calif.
|
 |
15
|
|
| |
16
|
WALTER, B. A robust and efficient protocol for checking the availability of remote sites. In Proceedings of the Sixth Workshop on Distributed Data Management and Computer Networks, (Berkeley, Calif., Feb. 1982), pp. 45-68.
|
CITED BY 49
|
|
|
|
|
M. Banâtre , Ph. Joubert , Ch. Morin , G. Muller , B. Rochat , P. Sanchez, Stable transactional memories and fault tolerant architectures, Proceedings of the 4th workshop on ACM SIGOPS European workshop, p.1-5, September 03-05, 1990, Bologna, Italy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Brendan Tangney , Vinny Cahill , Chris Horn , Dominic Herity , Alan Judge , Gradimir Starovic , Mark Sheppard, Some ideas on support for fault tolerance in COMANDOS, an object oriented distributed system, Proceedings of the 4th workshop on ACM SIGOPS European workshop, p.1-6, September 03-05, 1990, Bologna, Italy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Michael M. Swift , Muthukaruppan Annamalai , Brian N. Bershad , Henry M. Levy, Recovering device drivers, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.1-1, December 06-08, 2004, San Francisco, CA
|
|
|
|
|
|
Sudarshan M. Srinivasan , Srikanth Kandula , Christopher R. Andrews , Yuanyuan Zhou, Flashback: a lightweight extension for rollback and deterministic replay for software debugging, Proceedings of the USENIX Annual Technical Conference 2004 on USENIX Annual Technical Conference, p.3-3, June 27-July 02, 2004, Boston, MA
|
|
|
Joseph Tucek , James Newsome , Shan Lu , Chengdu Huang , Spiros Xanthos , David Brumley , Yuanyuan Zhou , Dawn Song, Sweeper: a lightweight end-to-end system for defending against fast worms, ACM SIGOPS Operating Systems Review, v.41 n.3, June 2007
|
|
|
|
|
|
|
|
|
|
|
|
Brendan Tangney , Vinny Cahill , Chris Horn , Dominic Herity , Alan Judge , Gradimir Starovic , Mark Sheppard, Some ideas on support for fault tolerance in COMANDOS, an object oriented distributed system, ACM SIGOPS Operating Systems Review, v.25 n.2, p.130-135, April 1991
|
|
|
M. Banâtre , Ph. Joubert , Ch. Morin , G. Muller , B. Rochat , P. Sanchez, Stable transactional memories and fault tolerant architectures, ACM SIGOPS Operating Systems Review, v.25 n.1, p.68-72, Jan. 1991
|
|
|
|
|
|
|
|
|
|
|
|
Oreste Villa , Sriram Krishnamoorthy , Jarek Nieplocha , David M. Brown, Jr., Scalable transparent checkpoint-restart of global address space applications on virtual machines over infiniband, Proceedings of the 6th ACM conference on Computing frontiers, May 18-20, 2009, Ischia, Italy
|
|
|
|
REVIEW
"Paul Siegel : Reviewer"
After many years of relatively quiet use in Bell Labs, universities, and a
few commercial development centers, the UNIX operating system has recently
become more popular. Its portability is a major asset: applications
developed on a machine run
more...
|