|
ABSTRACT
Recently there has been renewed interest in building reliable servers that support continuous application operation. Besides maintaining system state consistent after a failure, one of the main challenges in achieving continuous operation is to provide fast reconfiguration. The complexity of the failure reconfiguration mechanisms employed and their overheads depend on the type of platform that is being used as a server and the types of applications that need to be supported. In this paper we focus on providing support for shared-memory applications running on clusters of commodity nodes and interconnects. Achieving continuous operation for shared memory applications on clusters presents two main challenges. (a) The fault tolerance mechanisms employed should be transparent to applications and should have low overhead during failure-free execution. (b) When failures occur, reconfiguration should occur with minimum application disruption without requiring the full recovery of the failed node.In this work we examine in detail the latter, i.e., (b), the failure reconfiguration path. We use a previously developed system [8] that achieves (a) by using dynamic replication of data to the memories of multiple nodes of the system during execution. We examine in detail how the runtime system can achieve minimum application interruption, when failures occur. We present the design and implementation of FineFRC (Fine-grained Failure Recon guration on Clusters), a runtime system for achieving continuous operation of shared memory applications on commodity clusters without requiring application instrumentation or human intervention. We present results using a working, 16-processor system that achieves sub-second failure reconfiguration times.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Paul Barham , Boris Dragovic , Keir Fraser , Steven Hand , Tim Harris , Alex Ho , Rolf Neugebauer , Ian Pratt , Andrew Warfield, Xen and the art of virtualization, Proceedings of the nineteenth ACM symposium on Operating systems principles, October 19-22, 2003, Bolton Landing, NY, USA
|
| |
3
|
J. Bartlett, W. Bartlett, R. Carr, D. Garcia, J. G. R. Horst, R. Jardine, D. Lenoski, and D. McGuire. Fault tolerance in Tandem computer systems. Technical Report TR-90.5, Tandem, 1990.
|
| |
4
|
|
| |
5
|
A. Bilas, C. Liao, and J. P. Singh. Accelerating shared virtual memory using commodity ni support to avoid asynchronous message handling. In The 26th Int'l Symposium on Computer Architecture, May 1999.
|
| |
6
|
Nanette J. Boden , Danny Cohen , Robert E. Felderman , Alan E. Kulawik , Charles L. Seitz , Jakov N. Seizovic , Wen-King Su, Myrinet: A Gigabit-per-Second Local Area Network, IEEE Micro, v.15 n.1, p.29-36, February 1995
[doi> 10.1109/40.342015]
|
 |
7
|
Peter M. Chen , Wee Teck Ng , Subhachandra Chandra , Christopher Aycock , Gurushankar Rajamani , David Lowell, The Rio file cache: surviving operating system crashes, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.74-83, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
8
|
|
| |
9
|
V. S. Corp. Veritas firstwatch. http://www.veritas.com.
|
 |
10
|
Manuel Costa , Paulo Guedes , Manuel Sequeira , Nuno Neves , Miguel Castro, Lightweight logging for lazy release consistent distributed shared memory, Proceedings of the second USENIX symposium on Operating systems design and implementation, p.59-73, October 29-November 01, 1996, Seattle, Washington, United States
|
| |
11
|
C. Dubnicki, A. Bilas, Y. Chen, S. Damianakis, and K. Li. Vmmc-2: Efficient support for reliable, connection-oriented communication. In Proc. of the Hot Interconnects Symposium V, Aug. 1997.
|
 |
12
|
|
| |
13
|
IBM. High availability with DB2 UDB and Steeleye Lifekeeper. IBM Center for Advanced Studies Conference (CASCON): Technology Showcase, Toronto, Canada, Oct 2003.
|
 |
14
|
Dongming Jiang , Hongzhang Shan , Jaswinder Pal Singh, Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors, Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.217-229, June 18-21, 1997, Las Vegas, Nevada, United States
|
 |
15
|
|
| |
16
|
P. Keleher, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the 1994 Winter Usenix Conference, pages 115--131, Jan. 1994.
|
| |
17
|
|
| |
18
|
J. Kim and N. Vaidya. Analysis of failure recovery schemes for distributed shared-memory systems. IEEE Computers and Digital Techniques, 146(3), May 1999.
|
| |
19
|
K. Li. Ivy: A shared virtual memory system for parallel computing. Proceedings of the 1988 International Conference on Parallel Processing, 2:94--101, August 1988.
|
 |
20
|
|
| |
21
|
|
| |
22
|
NCR Lifekeeper. http://www.ncr.com.
|
 |
23
|
|
 |
24
|
|
 |
25
|
Daniel J. Sorin , Milo M. K. Martin , Mark D. Hill , David A. Wood, SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery, Proceedings of the 29th annual international symposium on Computer architecture, p.123, May 25-29, 2002, Anchorage, Alaska
|
| |
26
|
M. Stumm and S. Zhou. Fault tolerant distributed shared memory algorithms. In Proc. of the 2nd IEEE Symposium on Parallel and Distributed Processing, pages 719--724, December 1990.
|
| |
27
|
Florin Sultan , Liviu Iftode , Thu Nguyen, Scalable fault-tolerant distributed shared memory, Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), p.20-es, November 04-10, 2000, Dallas, Texas, United States
|
| |
28
|
VMware. Vmware ESX Server Storage Area Networks. http://www.vmware.com/, 2003.
|
| |
29
|
W. Vogels , D. Dumitriu , K. Birman , R. Gamache , M. Massa , R. Short , J. Vert , J. Barrera , J. Gray, The Design and Architecture of the Microsoft Cluster Service - A Practical Approach to High-Availability and Scalability, Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, p.422, June 23-25, 1998
|
| |
30
|
S. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. Methodological considerations and characterization of the SPLASH-2 parallel application suite. In Proceedings of the 23rd Int'l Symposium on Computer Architecture, May 1995.
|
 |
31
|
|
 |
32
|
Yuanyuan Zhou , Liviu Iftode , Kai Li, Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems, Proceedings of the second USENIX symposium on Operating systems design and implementation, p.75-88, October 29-November 01, 1996, Seattle, Washington, United States
|
| |
33
|
Transaction Processing Performance Council. TPC Benchmark B Standard Specification, August 1990.
|
| |
34
|
Transaction Processing Performance Council. TPC Benchmark C Standard Specification, August 1996.
|
|