|
ABSTRACT
Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources have accounts at multiple sites and have the option of choosing at which site or sites to submit a parallel job. In such a situation, the amount of time a user's job will wait in any one batch queue can significantly impact the overall time a user waits from job submission to job completion. In this work, we explore a new method for providing end-users with predictions for the bounds on the queuing delay individual jobs will experience. We evaluate this method using batch scheduler logs for distributed-memory parallel machines that cover a 9-year period at 7 large HPC centers.Our results show that it is possible to predict delay bounds reliably for jobs in different queues, and for jobs requesting different ranges of processor counts. Using this information, scientific application developers can intelligently decide where to submit their parallel codes in order to minimize overall turnaround time.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
IBM LoadLeveler User's Guide. Technical report, International Business Machines Corporation, 1993.
|
| |
2
|
|
| |
3
|
|
| |
4
|
J. Brevik, D. Nurmi, and R. Wolski. Quantifying machine availability in networked and desktop grid systems. In Proceedings of CCGrid04, April 2004.
|
| |
5
|
S.-H. Chiang and M. K. Vernon. Dynamic vs. Static Quantum-based Processor Allocation. Springer-Verlag, 1996.
|
| |
6
|
S. Clearwater and S. Kleban. Heavy-tailed distributions in supercomputer jobs. Technical Report SAND2002-2378C, Sandia National Labs, 2002.
|
| |
7
|
|
| |
8
|
|
| |
9
|
The Dror Feitelson's Parallel Workload Page. http://www.cs.huji.ac.il/labs/parallel/workload.
|
| |
10
|
D. G. Feitelson and B. Nitzberg. Job characteristics of a production parallel scientific workload on the NASA Ames iPSC/860. Springer-Verlag, 1996.
|
| |
11
|
D. G. Feitelson and L. Rudolph. Parallel Job Scheduling: Issues and Approaches. Springer-Verlag, 1995.
|
| |
12
|
D. G. Feitelson and L. Rudolph. Towards Convergence in Job Schedulers for Parallel Supercomputers. Springer-Verlag, 1996.
|
| |
13
|
|
| |
14
|
E. Frachtenberg, D. G. Feitelson, J. Fernandez, and F. Petrini. Parallel Job Scheduling Under Dynamic Workloads. Springer-Verlag, 2003.
|
| |
15
|
C. Granger and P. Newbold. Forecasting Economic Time Series. Academic Press, 1986.
|
| |
16
|
Gridengine home page -- http://gridengine.sunsource.net/.
|
| |
17
|
M. Harchol-Balter. The effect of heavy-tailed job size distributions on computer system design. In Proceedings of ASA-IMS Conference on Applications of Heavy Tailed Distributions in Economics, Engineering and Statistics, June 1999.
|
| |
18
|
D. Lifka. The ANL/IBMSP scheduling system, volume 949. Springer-Verlag, 1995.
|
| |
19
|
D. Lifka, M. Henderson, and K. Rayl. Users guide to the argonne SP scheduling system. Technical Report TM-201, Argonne National Laboratory, Mathematics and Computer Science Division, May 1995.
|
| |
20
|
B. Lindgren. Statistical Theory. MacMillan, 3 edition, 1968.
|
| |
21
|
Maui scheduler home page -- http://www.clusterresources.com/products/maui/.
|
| |
22
|
D. Moore. The Basic Practice of Statistics. W.H. Freeman, 2 edition, 2000.
|
| |
23
|
Cray NQE User's Guide -- http://docs.cray.com/books/2148 3.3/html-2148 3.3.
|
| |
24
|
NSF TeraGrid Project. http://www.teragrid.org/.
|
| |
25
|
D. Nurmi, J. Brevik, and R. Wolski. Modeling machine availability in enterprise and wide-area distributed computing environments. In Proceedings of Europar 2005, August 2005.
|
| |
26
|
D. Nurmi, R. Wolski, and J. Brevik. Model-based checkpoint scheduling for volatile resource environments. In Proceedings of Cluster 2005, September 2004.
|
| |
27
|
The network weather service home page -- http://nws.cs.ucsb.edu.
|
| |
28
|
Pbspro home page -- http://www.altair.com/software/pbspro.htm.
|
| |
29
|
|
| |
30
|
|
CITED BY 11
|
|
|
|
|
Daniel Nurmi , Anirban Mandal , John Brevik , Chuck Koelbel , Rich Wolski , Ken Kennedy, Grid scheduling and protocols---Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, November 11-17, 2006, Tampa, Florida
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|