Fully-Decentralized Coded Computing for Reliable Large-Scale Computing

Jeong, Haewon

doi:10.1184/R1/12000657.v1

Jeong_cmu_0041E_10516.pdf (2.96 MB)

Fully-Decentralized Coded Computing for Reliable Large-Scale Computing

thesis

posted on 2020-03-30, 19:35 authored by Haewon JeongHaewon Jeong

In this thesis, I ask the question “how do we compute reliably using thousands of distributed, unreliable nodes?” We propose a system-level solution where we add redundant data across distributed nodes using the technique called “coded computing.” Our main contribution is developing strategies for a masterless, fullydecentralized
setting for important computation primitives in machine learning (ML) and scientific computing applications while minimizing the overhead of coding. For distributed matrix multiplication, we make a fundamental advance by proposing
coded computing strategies that outperform prior works by an unbounded factor, including recently-developed coded computing strategies as well as traditional Algorithm-Based Fault Tolerance (ABFT) strategies. We also propose coded computing schemes for other primitives such as fast Fourier transform (FFT) and matrix QR factorization. Completing computation reliably and in time under diverse unpredictabilities (e.g., stragglers, node failures, bit flips) is becoming a more important problem. The amount of data we collect is growing exponentially and recent developments in ML have enabled utilizing and processing such large quantities of data. This has not only led to an increase in the scale of computing but also the wide popularity of largescale
computing across our society. I will discuss how masterless coded computing can be a more efficient fault-tolerance technique under growing unpredictability in computing systems, providing both theoretical and experimental evidence.

History

Date

2020-03-18

Degree Type

Dissertation

Department

Electrical and Computer Engineering

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Pulkit Grover

Usage metrics

Keywords

Coded computing Distributed computing Error-correcting codes Fault-tolerant algorithms High-performance computing Parallel matrix multiplication

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Fully-Decentralized Coded Computing for Reliable Large-Scale Computing

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports