Carnegie Mellon University
Browse

New perspectives on optimization: combating data poisoning, solving Euclidean optimization and learning minimax optimal estimators

Download (4.28 MB)
thesis
posted on 2025-06-24, 17:58 authored by Kartik GuptaKartik Gupta

Robustness properties of optimization techniques are a fundamental require- ment for modern ML applications. In an environment where vast quantities of data is scraped from the Internet or collected from varied sources, curating clean data has become an increasingly intractable problem. Sophisticated adversaries can easily bypass simple data cleaning methods to poison training data in order to influence the training procedure. Many attacks exist which can construct a backdoor in the learned model enabling maliciously targeted predictions during deployment. Some of the advanced attacks even claim broad generality, wherein, corruptions produced for one model can be successfully used to deploy an attack for another. This is a problem which models of all complexities have to contend with in the real world. To address this issue, existing research has focused on providing specialized solutions for specific model classes. This puts an enormous burden on the practitioner to keep pace with the rapid progress of research. Moreover, a lot of techniques treat the two problems of optimization and handling poisoned data separately.

In the first part of this thesis, we take a unified approach in which we provide a single optimization technique that can be used to train any ML model and which also shows impressive fortitude against data poisoning attacks. We experimentally evaluate our study on a wide class of ML models and provide a theoretical analysis for convergence as well as a mathematical understand- ing of what enables robustness against data poisoning for our techniques. For a given optimization problem, our technique constructs a sequence of smaller- dimensional optimization problems. It assumes access to a black box that can solve these smaller-dimensional problems and the solution of the last problem in the sequence is the desired model.

We continue with the theme of developing algorithms that use cleverly constructed black-box solvers in the second and third parts of this thesis. In the second part, we pose the problem of solving an Euclidean optimization problem as one that of solving an optimization problem on a manifold (specifically, the Grassmannian and the Multinomial manifold). This transforms an optimization problem from a given dimension to a sequence of problems in a smaller dimension. For a class of optimization problems which are defined using a data matrix, we also develop a technique which reduces the row dimension of the data matrix to construct the sequence of problems to solve. These techniques provide a very novel perspective on optimization problems in the Euclidean space and have the potential to inspire future developments of optimization procedures with robustness and privacy properties.

In the third part of the thesis, we study online optimization algorithms with access to black-box solvers for minimization and maximization procedures. These algorithms can be used to find the Nash Equilibrium of two-player zero- sum games. We formulate the problem of finding minimax optimal estimators as that of finding a Nash Equilibrium of a two-player zero-sum game. This helps us bypass the mathematical complexity of constructing minimax optimal estimators, which usually involve a lot of problem-specific analysis and development of new theoretical tools, to directly learn them as neural networks by solving the stated two-player game. Using this technique, we are able to construct minimax optimal estimators for a class of fundamental statistical estimation problems for which optimal estimators have not been known prior to this work.

In summary, in this thesis we develop techniques which have broad applicability within their domains (the domains being robust optimization, Euclidean optimization, and learning minimax estimators) by leveraging optimization procedures which have access to appropriate black boxes.

Funding

BIGDATA: F: DKA: Collaborative Research: High-Dimensional Statistical Machine Learning for Spatio-Temporal Climate Data

Directorate for Computer & Information Science & Engineering

Find out more...

Collaborative Research: Physics-Based Machine Learning for Sub-Seasonal Climate Forecasting

Directorate for Computer & Information Science & Engineering

Find out more...

Domain-knowledge Hybridized Statistical Machine Learning (DHSML)

United States Department of the Navy

Find out more...

History

Date

2024-08-06

Degree Type

  • Dissertation

Thesis Department

  • Machine Learning

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Pradeep Ravikumar