Towards Robust and Resilient Machine Learning
Some common assumptions when building machine learning pipeline are: (1) the training data is sufficiently “clean” and well-behaved, so that there are few or no outliers, or that the distribution of the data does not have very long tails, (2) the testing data follows the same
distribution as the training data, and (3) the data is generated from or is close to a known model class, such as a linear model or neural network. However, with easier access to computer, internet and various sensor-based technologies, modern data sets that arise in various
branches of science and engineering are no longer carefully curated and are often collected in a decentralized, distributed fashion. Consequently, they are plagued with the complexities of heterogeneity, adversarial manipulations, and outliers. As we enter this age of dirty data, the aforementioned assumptions of machine learning pipelines
are increasingly indefensible. For the widespread adoption of Machine Learning, we believe that it is imperative that any model should have the following three basic elements:
• Robustness: The model can be trained even with noisy and corrupted data.
• Reliability: After training and when deployed in the real-world, the model should not break down under benign shifts of the distribution.
• Resilience: The modeling procedure should work under model mis-specification, i.e. even when the modeling assumption breaks down, the model should find the best possible solution.
In this thesis, our goal is modify state of the art ML techniques and design new algorithms so that they work even without the aforementioned assumptions, and are robust, reliable and resilient. Our contributions are as follows: In chapter 2, we provide a new class of statisically-optimal estimators that are provably robust to a variety of robustness settings, such as arbitrary contamination, and heavy-tailed data, among others. In Chapter 3, we complement our statistical optimal estimators with a new class of computationally-efficient estimators for robust risk
minimization. These results provide some of the first computationally tractable and provably robust estimators for general statistical models such linear regression, logistic regression, among others. In Chapter 4, we study the problem of learning Ising models in a setting where some of the samples from the underlying distribution can be
arbitrarily corrupted. Finally, in Chapter 5, we discuss implications of our results for modern machine learning.
- Doctor of Philosophy (PhD)