Improving ML Applications in Shared Computing Environments

Harlap, Aaron

doi:10.1184/R1/8175407.v1

Harlap_cmu_0041E_10407.pdf (1.65 MB)

Improving ML Applications in Shared Computing Environments

thesis

posted on 2019-05-24, 18:21 authored by Aaron HarlapAaron Harlap

Machine learning (ML) has become a powerful building block for modern services, scientific endeavors and enterprise processes. The expensive computations required for training ML models often makes it desirable to run them in
a distributed manner in shared computing environments (e.g., Amazon EC2, Microsoft Azure, in-house shared clusters). Shared computing environments introduce a number of challenges, including uncorrelated performance jitter, heterogeneous resources, transient resources and limited bandwidth. This dissertation demonstrates that, by structuring software frameworks and work distribution to exploit transient resources and address performance jitter and communication bandwidth limitations, we can improve the eefficiency of training machine learning models.
We support this assertion with three case study systems: FlexRR, Proteus, and PipeDream. FlexRR is a distributed machine learning training system that combines a flexible synchronization model with dynamic peer-to-peer reassignment of work among workers to address stragglers caused by performance jitter. FlexRR observes near ideal run-time, mitigating the adverse effects of stragglers observed in shared computing environments. Proteus is an agile elastic machine learning training system that uses tiers of reliability and intelligent resource management to efficiently utilize transient compute resources.
Evaluations on AWS EC2 show that Proteus reduces cost by 85% relative to non-transient pricing, and by 43% relative to previous approaches, while simultaneously reducing runtimes by up to 37%. PipeDream is a distributed training
system for deep neural networks (DNNs) that partitions ranges of DNN layers among machines and aggressively pipelines computation and communication. By reducing the amount of communication, and overlapping communication
and computation, PipeDream provides a 5x or more improvement in \time to accuracy" for training large DNN models.

History

Date

2019-05-10

Degree Type

Dissertation

Department

Electrical and Computer Engineering

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Greg Ganger Phil Gibbons

Usage metrics

Keywords

Cloud Computing Distributed Machine Learning DNNs Elasticity Stragglers

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Improving ML Applications in Shared Computing Environments

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports