Probabilistic Methods for Mitigating Uncertainties in Large-Scale Computing and Machine Learning
Large-scale computing and machine learning (ML) has been spurred in recent years by the availability of nearly unlimited amounts of computing infrastructure and data. For a long time, the focus of algorithms in these areas has been on achieving the intended goal with the highest fidelity, and scaling to the available infrastructure and data with the greatest efficiency. Underlying this has been an inherent faith, in the reliability of the infrastructure and the data, which is increasingly becoming unjustifiable. Specifically, with the end of Moore’s Law, computing resources are unable to scale to the demands of modern workloads while providing the required level of fault tolerance. Likewise, with the adoption of ML for predictive modeling in new and challenging domains like natural sciences, infrastructure management, etc., the quantity and quality of available data is often inadequate for the ML models to be sufficiently effective. Such unreliability is leading to growing uncertainty in the performance of specific computing and ML algorithms which needs to be explicitly addressed.
In this thesis, we address uncertainty through the lens of probabilistic methods. The common theme is replacing rule-based (hard) models with probabilistic (soft) models when designing algorithms under uncertainty. We apply this philosophy to three problems corresponding to three types of unreliability.
In Part I, we look into the problem of stragglers - unreliable or slow nodes that cause unexpected delays in large-scale distributed computing. We mitigate these delays by using rateless erasure codes to generate coded distributed computations. The probabilistic design of rateless codes ensures that the overall computation can be recovered from any sufficiently large subset of coded computations, with high probability, leading to near-ideal performance in the presence of stragglers.
In Part II, we investigate the challenge of designing effective ML algorithms in settings where the amount of data is significantly smaller than that required by the popular deep learning algorithms which leads to overfitting. We proposed novel probabilistic ML algorithms which compute similarities between latent probabilistic representations of training and test data to make meaningful predictions that work well even in small-data settings due to the robustness of probabilistic representations to overfitting.
Finally, in Part III, we address the problem of drift (change) in data distribution that causes accuracy drops in ML models, in the context of applying ML to decision making in cloud-computing systems. The transiency of the cloud leads to data drift, and the strict scalability and latency requirements from users render conventional solutions impractical or ineffective. Our solution again takes a probabilistic approach wherein we train several models on subsets of old data (offline), and rank and select the best model (online). The ranking is designed to be calculated efficiently, in line with the scalabaility and latency requirements, while accurately reflecting the probability of success on the new (test) data.
History
Date
2022-07-30Degree Type
- Dissertation
Department
- Electrical and Computer Engineering
Degree Name
- Doctor of Philosophy (PhD)