Credence: Algorithm Based Fault Tolerance at Datacenter Scale

2019-01-18T21:45:34Z (GMT) by Utsav Sheth
As an increasing number of modern big data systems utilize horizontal scaling,<br>the general trend in the distributed systems world has been to use general purpose<br>com- modity hardware to reduce capital expenditure. System failures resulting from<br>the use of inferior hardware have therefore become common at scale. Further, congested<br>datacenter networks can result in high communication latencies and packet<br>drops at network switches. Coded computing is a novel computing technique based<br>on error correcting codes that aims to achieve algorithm based fault tolerance in a<br>distributed system that is composed of unreliable compute nodes and networks. In<br>this thesis, we explore the application of coded computing techniques to the problem<br>of distributed matrix multiplication. Matrix multiplication is foundational to a<br>number of applications today ranging from machine learning to scienti c computing.<br>We discuss some applications of coded matrix multiplication and then discuss<br>the design and implementation of a Mesos framework that utilizes coded computing<br>for distributed matrix multiplication and the methodology used to evaluate it. Finally,<br>we discuss a novel scheduling strategy to minimize the latency of coded matrix<br>multiplication jobs.<br><br>