ACM Home Page
Please provide us with feedback. Feedback
Using checkpointing to recover from poor multi-site parallel job scheduling decisions
Full text PdfPdf (267 KB)
Source Middleware Conference archive
Proceedings of the 5th international workshop on Middleware for grid computing: held at the ACM/IFIP/USENIX 8th International Middleware Conference table of contents
Newport Beach, California
Article No. 2  
Year of Publication: 2007
ISBN:978-1-59593-944-9
Author
William M. Jones  United States Naval Academy, Annapolis, MD
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 1,   Downloads (12 Months): 28,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376849.1376851
What is a DOI?

ABSTRACT

Recent research in multi-site parallel job scheduling leverages user-provided estimates of job communication characteristics to effectively partition the job across multiple clusters. Previous research addressed the impact of inaccuracies in these estimates on overall system performance and found that multi-site scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are many instances where these errors result in poor scheduling decisions that cause network over-subscription. This situation can lead to significantly degraded application runtime performance and turnaround time.

In this paper, we explore the use of job checkpointing to selectively stop offending jobs in order to alleviate network congestion and subsequently restart them when (and where) sufficient network resources are available. We then characterize the conditions and the extent to which checkpointing improves overall performance. We demonstrate that checkpointing is beneficial even when the overhead of doing so is costly.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
W. M. Jones. The impact of error in user-provided bandwidth estimates on multi-site parallel job scheduling performance. In The 19th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2007), to appear November 2007.
 
4
 
5
J. Ngubiri, M. van Vliet, and R. U. Nijmegen. Group-wise performance evaluation of processor co-allocation in multi-cluster systems. In Job Scheduling Strategies for Parallel Processing. Springer Verlag, 2007. to appear in Lect. Notes Comput. Sci.
 
6
 
7
 
8
 
9
 
10
 
11
Z. Weizhe, F. Binxing, H. Mingzeng, L. Xinran, Z. Hongli, and G. Lei. Multisite co-allocation scheduling algorithms for parallel jobs in computing grid environments. Science in China Series F: Information Sciences, 49(6):906--926, 2006.