Jeong_cmu_0041E_10516.pdf (2.96 MB)

Fully-Decentralized Coded Computing for Reliable Large-Scale Computing

Download (2.96 MB)
thesis
posted on 30.03.2020 by Haewon Jeong
In this thesis, I ask the question “how do we compute reliably using thousands of distributed, unreliable nodes?” We propose a system-level solution where we add redundant data across distributed nodes using the technique called “coded computing.” Our main contribution is developing strategies for a masterless, fullydecentralized
setting for important computation primitives in machine learning (ML) and scientific computing applications while minimizing the overhead of coding. For distributed matrix multiplication, we make a fundamental advance by proposing
coded computing strategies that outperform prior works by an unbounded factor, including recently-developed coded computing strategies as well as traditional Algorithm-Based Fault Tolerance (ABFT) strategies. We also propose coded computing schemes for other primitives such as fast Fourier transform (FFT) and matrix QR factorization. Completing computation reliably and in time under diverse unpredictabilities (e.g., stragglers, node failures, bit flips) is becoming a more important problem. The amount of data we collect is growing exponentially and recent developments in ML have enabled utilizing and processing such large quantities of data. This has not only led to an increase in the scale of computing but also the wide popularity of largescale
computing across our society. I will discuss how masterless coded computing can be a more efficient fault-tolerance technique under growing unpredictability in computing systems, providing both theoretical and experimental evidence.

History

Date

18/03/2020

Degree Type

Dissertation

Department

Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Pulkit Grover

Exports