Hsieh_cmu_0041E_10473.pdf (4.26 MB)
Download file

Machine Learning Systems for Highly-Distributed and Rapidly-Growing Data

Download (4.26 MB)
posted on 13.11.2019, 19:22 authored by Kevin Hsieh
‘e usability and practicality of any machine learning (ML) applications are largely inƒfluenced by two critical but hard-to-att‹ain factors: low latency and low cost. Unfortunately, achieving low latency and low cost is very challenging when ML depends on real-world data that are highly distributed and rapidly growing (e.g., data collected by mobile phones and video cameras all over the world). Such real-world
data pose many challenges in communication and computation. For example, when training data are distributed across data centers that span multiple continents, communication among data centers can easily overwhelm the limited wide-area network bandwidth, leading to prohibitively high latency and high cost. In this dissertation, we demonstrate that the latency and cost of ML on highly distributed and rapidly-growing data can be improved by one to two orders of magnitude
by designing ML systems that exploit the characteristics of ML algorithms, ML model structures, and ML training/serving data. We support this thesis statement with three contributions. First, we design a system that provides both low-latency and low-cost ML serving (inferencing) over large-scale and continuously-growing datasets, such as videos. Second, we build a system that makes ML training over geodistributed datasets as fast as training within a single data center. Th‘ird, we present a €first detailed study and a system-level solution on a fundamental and largely overlooked problem: ML training over non-IID (i.e., not independent and identically distributed) data partitions (e.g., facial images collected by cameras varies according to
the demographics of each camera’s location).




Degree Type



Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)


Phil Gibbons Onur Mutlu