ACM Home Page
Please provide us with feedback. Feedback
Predicting bounds on queuing delay for batch-scheduled parallel machines
Full text PdfPdf (421 KB)
Source Principles and Practice of Parallel Programming archive
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming table of contents
New York, New York, USA
SESSION: Shared memory parallelism table of contents
Pages: 110 - 118  
Year of Publication: 2006
ISBN:1-59593-189-9
Authors
John Brevik  University of California, Santa Barbara, CA
Daniel Nurmi  University of California, Santa Barbara, CA
Rich Wolski  University of California, Santa Barbara, CA
Sponsors
ACM: Association for Computing Machinery
SIGPLAN: ACM Special Interest Group on Programming Languages
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 56,   Citation Count: 11
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1122971.1122989
What is a DOI?

ABSTRACT

Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources have accounts at multiple sites and have the option of choosing at which site or sites to submit a parallel job. In such a situation, the amount of time a user's job will wait in any one batch queue can significantly impact the overall time a user waits from job submission to job completion. In this work, we explore a new method for providing end-users with predictions for the bounds on the queuing delay individual jobs will experience. We evaluate this method using batch scheduler logs for distributed-memory parallel machines that cover a 9-year period at 7 large HPC centers.Our results show that it is possible to predict delay bounds reliably for jobs in different queues, and for jobs requesting different ranges of processor counts. Using this information, scientific application developers can intelligently decide where to submit their parallel codes in order to minimize overall turnaround time.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
IBM LoadLeveler User's Guide. Technical report, International Business Machines Corporation, 1993.
 
2
 
3
 
4
J. Brevik, D. Nurmi, and R. Wolski. Quantifying machine availability in networked and desktop grid systems. In Proceedings of CCGrid04, April 2004.
 
5
S.-H. Chiang and M. K. Vernon. Dynamic vs. Static Quantum-based Processor Allocation. Springer-Verlag, 1996.
 
6
S. Clearwater and S. Kleban. Heavy-tailed distributions in supercomputer jobs. Technical Report SAND2002-2378C, Sandia National Labs, 2002.
 
7
 
8
 
9
The Dror Feitelson's Parallel Workload Page. http://www.cs.huji.ac.il/labs/parallel/workload.
 
10
D. G. Feitelson and B. Nitzberg. Job characteristics of a production parallel scientific workload on the NASA Ames iPSC/860. Springer-Verlag, 1996.
 
11
D. G. Feitelson and L. Rudolph. Parallel Job Scheduling: Issues and Approaches. Springer-Verlag, 1995.
 
12
D. G. Feitelson and L. Rudolph. Towards Convergence in Job Schedulers for Parallel Supercomputers. Springer-Verlag, 1996.
 
13
 
14
E. Frachtenberg, D. G. Feitelson, J. Fernandez, and F. Petrini. Parallel Job Scheduling Under Dynamic Workloads. Springer-Verlag, 2003.
 
15
C. Granger and P. Newbold. Forecasting Economic Time Series. Academic Press, 1986.
 
16
Gridengine home page -- http://gridengine.sunsource.net/.
 
17
M. Harchol-Balter. The effect of heavy-tailed job size distributions on computer system design. In Proceedings of ASA-IMS Conference on Applications of Heavy Tailed Distributions in Economics, Engineering and Statistics, June 1999.
 
18
D. Lifka. The ANL/IBMSP scheduling system, volume 949. Springer-Verlag, 1995.
 
19
D. Lifka, M. Henderson, and K. Rayl. Users guide to the argonne SP scheduling system. Technical Report TM-201, Argonne National Laboratory, Mathematics and Computer Science Division, May 1995.
 
20
B. Lindgren. Statistical Theory. MacMillan, 3 edition, 1968.
 
21
Maui scheduler home page -- http://www.clusterresources.com/products/maui/.
 
22
D. Moore. The Basic Practice of Statistics. W.H. Freeman, 2 edition, 2000.
 
23
Cray NQE User's Guide -- http://docs.cray.com/books/2148 3.3/html-2148 3.3.
 
24
NSF TeraGrid Project. http://www.teragrid.org/.
 
25
D. Nurmi, J. Brevik, and R. Wolski. Modeling machine availability in enterprise and wide-area distributed computing environments. In Proceedings of Europar 2005, August 2005.
 
26
D. Nurmi, R. Wolski, and J. Brevik. Model-based checkpoint scheduling for volatile resource environments. In Proceedings of Cluster 2005, September 2004.
 
27
The network weather service home page -- http://nws.cs.ucsb.edu.
 
28
Pbspro home page -- http://www.altair.com/software/pbspro.htm.
 
29
 
30

CITED BY  11

Collaborative Colleagues:
John Brevik: colleagues
Daniel Nurmi: colleagues
Rich Wolski: colleagues