Carnegie Mellon University
Browse
RaviShankar_cmu_0041O_11179.pdf (517.85 kB)

SiaHet: Towards Exploiting Intra-Job Resource Heterogeneity in Heterogeneity-aware, Goodput Optimized Deep Learning Cluster Scheduling

Download (517.85 kB)
thesis
posted on 2024-05-31, 19:31 authored by Nishant Ravi Shankar

 The Sia scheduler represents an advancement in efficiently allocating cluster re- sources to Deep Learning Training jobs in a heterogeneous GPU cluster, resulting in improved Job Completion Times (JCTs) and cluster utilization. However, its current implementation only addresses GPU heterogeneity in its allocation decisions, while eventually choosing to assign only homogeneous GPU resources for a Deep Learn- ing Training job. This limitation highlights the significance of exploring Intra-Job Heterogeneity during scheduling decisions, which can unlock more cluster utilization and parallelism, thereby further optimizing average JCTs. The aim of the thesis is to address Sia scheduler’s limitation by developing SiaHet, an enhancement over the Sia Scheduler, that proposes two key features to unlock Intra-Job Resource Hetero- geneity: a) An enhanced heterogeneous resource allocation policy for Deep Learning Training jobs b) an execution engine runtime that is capable of performing hybrid parallel Deep Learning execution over heterogeneous resources. This thesis also aims to study the effects of Intra-Job Resource Heterogeneity through workload experi- ments in a small-scale research cluster, demonstrating their benefits on mixed priority workloads on specific cluster sizes. 

History

Date

2024-05-03

Degree Type

  • Dissertation

Department

  • Information Networking Institute

Degree Name

  • Master of Science (MS)

Advisor(s)

Gregory Ganger

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC