Adaptive Scheduling With Parallelism Feedback
- Kunal Agrawal ,
- Yuxiong He ,
- Wen-Jing Hsu ,
- Charles E. Leiserson
PPOPP |
Multiprocessor scheduling in a shared multiprogramming environment is often structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level task scheduler schedules the work of a job on the allotted processors. In this context, the number of processors allotted to a particular job may vary during the job’s execution, and the task scheduler must adapt to these changes in processor resources. For overall system efficiency, the task scheduler should also provide parallelism feedback to the job scheduler to avoid the situation where a job is allotted processors that it cannot use productively. We present an adaptive task scheduler for multitasked jobs with dependencies that provides continual parallelism feedback to the job scheduler in the form of requests for processors. Our scheduler guarantees that a job completes near optimally while utilizing at least a constant fraction of the allotted processor cycles. Our scheduler can be applied to schedule data-parallel programs, such as those written in High Performance Fortran (HPF), *Lisp, C*, NESL, and ZPL. Our analysis models the job scheduler as the task scheduler’s adversary, challenging the task scheduler to be robust to the system environment and the job scheduler’s administrative policies. For example, the job scheduler can make available a huge number of processors exactly when the job has little use for them. To analyze the performance of our adaptive task scheduler under this stringent adversarial assumption, we introduce a new technique called “trim analysis,” which allows us to prove that our task scheduler performs poorly on at most a small number of time steps, exhibiting nearoptimal behavior on the vast majority. To be precise, suppose that a job has work T1 and critical-path length T∞ and is running on a machine with P processors. Using trim analysis, we prove that our scheduler completes the job in O(T1/ P + T∞ + LlgP) time steps, where L is the length of a scheduling quantum and P denotes the O(T∞ + LlgP)-trimmed availability. This quantity is the average of the processor availability over all time steps excluding the O(T∞ + LlgP) time steps with the highest processor availability. When T1/T∞ P (the job’s parallelism dominates the O(T∞ + LlgP)-trimmed availability), the job achieves nearly perfect linear speedup. Conversely, when T1/T∞ P, the asymptotic running time of the job is nearly the length of its critical path.