ACM Home Page
Please provide us with feedback. Feedback
Failure-aware checkpointing in fine-grained cycle sharing systems
Full text PdfPdf (234 KB)
Source
High Performance Distributed Computing archive
Proceedings of the 16th international symposium on High performance distributed computing table of contents
Monterey, California, USA
SESSION: Reliability and fault tolerance table of contents
Pages: 33 - 42  
Year of Publication: 2007
ISBN:978-1-59593-673-8
Authors
Xiaojuan Ren  Purdue University
Rudolf Eigenmann  Purdue University
Saurabh Bagchi  Purdue University
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 37,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1272366.1272372
What is a DOI?

ABSTRACT

Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amountof idle computational resources available on the Internet. Such systems allow guest jobs to run on a host if they do not significantly impact the local users of the host. Since the hosts are typically provided voluntarily, their availability fluctuates greatly. To provide fault tolerance to guest jobs without adding significant computational overhead, we propose failure-aware checkpointing techniques that apply the knowledge of resource availability to select checkpoint repositories and to determine checkpoint intervals. We present the schemes of selecting reliable and efficient repositories from the non-dedicated hosts that contribute their disk storage. These schemes are formulated as 0/1 programming problems to optimize the network overhead of transferring checkpoints and the work lost due to unavailability of a storage host when needed to recover a guest job. We determine the checkpoint interval by comparing the cost of checkpointing immediately and the cost of delaying that to a later time, which is a function of the resource availability. We evaluate these techniques on an FGCS system called iShare, using trace-based simulation. The results show that they achieve better application performance than the prevalent methods which use checkpointing with a fixed periodicity on dedicated checkpoint servers.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
R. Buyya and M. Murshed. GridSim: A toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurrency and Computation: Practice and Experience, 14:1175--1220, 2002.
 
4
5
 
6
 
7
 
8
 
9
 
10
D. Nurmi, J. Brevik, and R. Wolski. Minimizing the network overhead of checkpointing in cycle-harvesting cluster environments. In Proc. of Cluster'05, 2006.
 
11
 
12
 
13
14
 
15
X. Ren and R. Eigenmann. iShare - Open internet sharing built on P2P and web. In Proc. of EGC'05, pages 1117--1127, 2005.
 
16
 
17
X. Ren, R. Eigenmann, and S. Bagchi. Availability prediction for non-dedicated storages in fine-grained cycle sharing systems. Technical Report ECE-HPCLab-06201, Purdue University, 2006.
 
18
X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Resource availability prediction in fine-grained cycle sharing systems. In Proc. of HPDC'06, pages 93--104, 2006.
 
19
X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Prediction of resource availability in fine-grained cycle sharing systems and empirical evaluation. To appear in the Journal of Grid Computing, 2007.
 
20
 
21
 
22
 
23
 
24
 
25
Y. Y. Zhang, M. Squillante, A. Sivasubramaniam, and R. K. Sahoo. Performance implications of failures in large-scale cluster scheduling. In 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004.
 
26
D. Zhou and V. Lo. Wave scheduler: Scheduling for faster turnaround time in peer-based desktop grid systems. mIn Proc. of the 11th Workshop on Job Scheduling Strategies for Parallel Processing, 2005.


Collaborative Colleagues:
Xiaojuan Ren: colleagues
Rudolf Eigenmann: colleagues
Saurabh Bagchi: colleagues