Coded Computing Systems Decoded: Dealing with Unreliability and Elasticity in Modern Computing

2019-06-06T19:10:40Z (GMT) by Yaoqing Yang
Robustness is a fundamental and timeless issue, and it remains vital to all aspects of computation systems, regardless of specific computation platforms, architectures, and algorithm design. The issue is also timely: modern computing systems are increasingly built on unreliable substrates. This thesis designs reliable computing techniques for distributed systems, circuits and networks.<br>We primarily study techniques inspired from coding theory to address the robustness issues such as system elasticity, stragglers (slow workers), machine failures and soft errors, by carefully weaving redundancy into the data and<br>the design of the algorithm. We primarily focus on three aspects of coding-based computation techniques.<br>The first aspect is to design adaptive computing techniques in largescale systems that are elastic, i.e., the number of machines can change with time. The second is to make the coding-based computation schemes cater to the specific needs of iterative and multi-stage algorithms, such as power iterations and gradient descent that are essential for today’s data analytics. The third aspect is to study the fundamental limits of information propagation in computation networks, provide information-theoretic outer bounds on error<br>accumulation, and design in-network computing schemes to compensate for this error accumulation, e.g., in a circuit network where each computation<br>component can be unreliable. This thesis presents theoretical results that are fundamental to the understanding<br>of computation systems, and opportunities for radical improvements, e.g. in computation load, communication overhead, and storage cost. We also present real-system implementations and real-data experiments and show our progress on taking these theoretical results closer to practice. The academic results presented in this thesis advance on classical results in the historical field of reliable distributed computing, while simultaneously addressing timely issues arising in today’s computing systems.<br>