SiaHet: Towards Exploiting Intra-Job Resource Heterogeneity in Heterogeneity-aware, Goodput Optimized Deep Learning Cluster Scheduling
The Sia scheduler represents an advancement in efficiently allocating cluster re- sources to Deep Learning Training jobs in a heterogeneous GPU cluster, resulting in improved Job Completion Times (JCTs) and cluster utilization. However, its current implementation only addresses GPU heterogeneity in its allocation decisions, while eventually choosing to assign only homogeneous GPU resources for a Deep Learn- ing Training job. This limitation highlights the significance of exploring Intra-Job Heterogeneity during scheduling decisions, which can unlock more cluster utilization and parallelism, thereby further optimizing average JCTs. The aim of the thesis is to address Sia scheduler’s limitation by developing SiaHet, an enhancement over the Sia Scheduler, that proposes two key features to unlock Intra-Job Resource Hetero- geneity: a) An enhanced heterogeneous resource allocation policy for Deep Learning Training jobs b) an execution engine runtime that is capable of performing hybrid parallel Deep Learning execution over heterogeneous resources. This thesis also aims to study the effects of Intra-Job Resource Heterogeneity through workload experi- ments in a small-scale research cluster, demonstrating their benefits on mixed priority workloads on specific cluster sizes.
History
Date
2024-05-03Degree Type
- Dissertation
Department
- Information Networking Institute
Degree Name
- Master of Science (MS)