A Principled Approach to Parallel Job Scheduling

Berg, Benjamin

doi:10.1184/R1/21817980.v1

CMU-CS-22-138.pdf (4.49 MB)

A Principled Approach to Parallel Job Scheduling

thesis

posted on 2023-01-06, 21:53 authored by Benjamin BergBenjamin Berg

A wide range of modern computer systems process workloads composed of parallelizable jobs. Data centers, supercomputers, machine learning clusters, distributed computing frameworks, and databases all process jobs which are designed to be parallelized across multiple servers or cores. A job will receive some speedup from being parallelized across additional servers or cores, allowing the job to complete more quickly. However, jobs generally cannot be perfectly parallelized, and receive diminishing returns from being allocated additional cores. Hence, given a fixed number of cores, it is not obvious how to allocate cores across a set of jobs in order to reduce the response times of the jobs — the times from when each job arrives to the system until it is completed. While this question has been considered by the worst-case scheduling community, existing theoretical results are hampered by strong lower bounds on worst-case performance and tend to suggest policies which do not work well in practice. Meanwhile, state-of-the-art systems employ simple heuristic-based policies that leave significant room for improvement. The goal of this thesis is to develop and analyze policies for scheduling parallelizable jobs which perform well both in theory and in practice.

Our approach in developing new scheduling policies for parallelizable jobs is threefold. First, we develop new stochastic models of parallelizable jobs running in a multicore system. Second, we analyze these new models using the tools of stochastic performance modeling, showing that a stochastic style of analysis emits scheduling policies which provably outperform the policies suggested by the worst-case literature. Finally, we validate our theoretical models through simulation to show that the scheduling policies we derive work well in practice. We consider the case where job sizes are completely unknown to the system, the case where job sizes are perfectly known to the system, and the case where some jobs are known to be larger than others on average. Similarly, we consider the case where all jobs are assumed to have the same level of parallelizability, the case where some jobs are more parallelizable than others, and the case where a job’s parallelizability can change over time.

History

Date

2022-08-01

Degree Type

Dissertation

Department

Computer Science

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Mor Harchol-Balter

Usage metrics

Keywords

performance modeling scheduling parallelizable jobs speedup functions

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

A Principled Approach to Parallel Job Scheduling

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports