Carnegie Mellon University
Browse

Towards Robust and Resilient Machine Learning

Download (2.55 MB)
thesis
posted on 2022-04-21, 20:00 authored by Adarsh PrasadAdarsh Prasad

Some common assumptions when building machine learning pipeline are: (1) the training data is sufficiently “clean” and well-behaved, so that there are few or no outliers, or that the distribution of the data does not have very long tails, (2) the testing data follows the same

distribution as the training data, and (3) the data is generated from or is close to a known model class, such as a linear model or neural network. However, with easier access to computer, internet and various sensor-based technologies, modern data sets that arise in various

branches of science and engineering are no longer carefully curated and are often collected in a decentralized, distributed fashion. Consequently, they are plagued with the complexities of heterogeneity, adversarial manipulations, and outliers. As we enter this age of dirty data, the aforementioned assumptions of machine learning pipelines

are increasingly indefensible. For the widespread adoption of Machine Learning, we believe that it is imperative that any model should have the following three basic elements:

• Robustness: The model can be trained even with noisy and corrupted data.

• Reliability: After training and when deployed in the real-world, the model should not break down under benign shifts of the distribution.

• Resilience: The modeling procedure should work under model mis-specification, i.e. even when the modeling assumption breaks down, the model should find the best possible solution.

In this thesis, our goal is modify state of the art ML techniques and design new algorithms so that they work even without the aforementioned assumptions, and are robust, reliable and resilient. Our contributions are as follows: In chapter 2, we provide a new class of statisically-optimal estimators that are provably robust to a variety of robustness settings, such as arbitrary contamination, and heavy-tailed data, among others. In Chapter 3, we complement our statistical optimal estimators with a new class of computationally-efficient estimators for robust risk

minimization. These results provide some of the first computationally tractable and provably robust estimators for general statistical models such linear regression, logistic regression, among others. In Chapter 4, we study the problem of learning Ising models in a setting where some of the samples from the underlying distribution can be

arbitrarily corrupted. Finally, in Chapter 5, we discuss implications of our results for modern machine learning.

History

Date

2021-05-11

Degree Type

  • Dissertation

Department

  • Machine Learning

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Pradeep Ravikumar Sivaraman Balakrishnan

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC