Elastic Machine Learning Systems with Co-adaptation

Qiao, Aurick

doi:10.1184/R1/16688761.v1

CMU-CS-21-135.pdf (3.15 MB)

Elastic Machine Learning Systems with Co-adaptation

thesis

posted on 2021-09-28, 15:10 authored by Aurick Qiao

In recent years, the amount of computation being invested into machine learning (ML) and deep learning (DL) training has multiplied by several orders of magnitude. Under these conditions, elasticity—the ability of a system to dynamically adapt to changing supply and demand of compute resources over time—is a key ingredient for efficient resource management. Elasticity has long been proven to improve the resource utilization, execution performance, and fault tolerance of traditional applications such as web services and big data processing. However, elastic ML training is a relatively new area of interest, and faces different challenges from traditional applications due to ML training’s highly sub-linear resource scalability, diverse execution patterns and strategies, and dependence between distributed workers. This thesis steps beyond the existing early work in elastic ML by employing co-adaptation, i.e. combining both system-level and application-side adaptations, to better adapt to dynamic compute resources. Although previous frameworks can enable elasticity by relying on system-level implementations, they ignore the inherent resource adaptability of ML training that can be leveraged to better overcome the aforementioned challenges. We present the design, implementation, and evaluation of three elastic systems for ML that improve DL training time in shared GPU clusters by 37-50%, enable elasticity for a diverse set of ML training applications, and reduce the impact of resource failures by 78-95%.

History

Date

2021-08-08

Degree Type

Dissertation

Department

Computer Science

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Eric P. Xing

Usage metrics

Keywords

machine learning deep learning elasticity theory cluster scheduling fault tolerance

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Elastic Machine Learning Systems with Co-adaptation

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports