## Fully-Decentralized Coded Computing for Reliable Large-Scale Computing

thesis

posted on 30.03.2020 by Haewon Jeong#### thesis

In order to distinguish essays and pre-prints from academic theses, we have a separate category. These are often much longer text based documents than a paper.

In this thesis, I ask the question “how do we compute reliably using thousands of distributed, unreliable nodes?” We propose a system-level solution where we add redundant data across distributed nodes using the technique called “coded computing.” Our main contribution is developing strategies for a masterless, fullydecentralized

setting for important computation primitives in machine learning (ML) and scientific computing applications while minimizing the overhead of coding. For distributed matrix multiplication, we make a fundamental advance by proposing

coded computing strategies that outperform prior works by an unbounded factor, including recently-developed coded computing strategies as well as traditional Algorithm-Based Fault Tolerance (ABFT) strategies. We also propose coded computing schemes for other primitives such as fast Fourier transform (FFT) and matrix QR factorization. Completing computation reliably and in time under diverse unpredictabilities (e.g., stragglers, node failures, bit flips) is becoming a more important problem. The amount of data we collect is growing exponentially and recent developments in ML have enabled utilizing and processing such large quantities of data. This has not only led to an increase in the scale of computing but also the wide popularity of largescale

computing across our society. I will discuss how masterless coded computing can be a more efficient fault-tolerance technique under growing unpredictability in computing systems, providing both theoretical and experimental evidence.

setting for important computation primitives in machine learning (ML) and scientific computing applications while minimizing the overhead of coding. For distributed matrix multiplication, we make a fundamental advance by proposing

coded computing strategies that outperform prior works by an unbounded factor, including recently-developed coded computing strategies as well as traditional Algorithm-Based Fault Tolerance (ABFT) strategies. We also propose coded computing schemes for other primitives such as fast Fourier transform (FFT) and matrix QR factorization. Completing computation reliably and in time under diverse unpredictabilities (e.g., stragglers, node failures, bit flips) is becoming a more important problem. The amount of data we collect is growing exponentially and recent developments in ML have enabled utilizing and processing such large quantities of data. This has not only led to an increase in the scale of computing but also the wide popularity of largescale

computing across our society. I will discuss how masterless coded computing can be a more efficient fault-tolerance technique under growing unpredictability in computing systems, providing both theoretical and experimental evidence.

### History

#### Date

18/03/2020#### Degree Type

Dissertation#### Department

Electrical and Computer Engineering#### Degree Name

- Doctor of Philosophy (PhD)