Carnegie Mellon University
Browse
Sheth_cmu_0041O_10328.pdf (1.32 MB)

Credence: Algorithm Based Fault Tolerance at Datacenter Scale

Download (1.32 MB)
thesis
posted on 2018-12-01, 00:00 authored by Utsav ShethUtsav Sheth
As an increasing number of modern big data systems utilize horizontal scaling,
the general trend in the distributed systems world has been to use general purpose
com- modity hardware to reduce capital expenditure. System failures resulting from
the use of inferior hardware have therefore become common at scale. Further, congested
datacenter networks can result in high communication latencies and packet
drops at network switches. Coded computing is a novel computing technique based
on error correcting codes that aims to achieve algorithm based fault tolerance in a
distributed system that is composed of unreliable compute nodes and networks. In
this thesis, we explore the application of coded computing techniques to the problem
of distributed matrix multiplication. Matrix multiplication is foundational to a
number of applications today ranging from machine learning to scienti c computing.
We discuss some applications of coded matrix multiplication and then discuss
the design and implementation of a Mesos framework that utilizes coded computing
for distributed matrix multiplication and the methodology used to evaluate it. Finally,
we discuss a novel scheduling strategy to minimize the latency of coded matrix
multiplication jobs.

History

Date

2018-12-01

Degree Type

  • Master's Thesis

Department

  • Information Networking Institute

Degree Name

  • Master of Science (MS)

Advisor(s)

Pulkit Grover

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC