Provable, Structured, and Efficient Methods for Robustness of Deep Networks to Adversarial Examples
thesisposted on 25.01.2021, 22:31 by Eric Wong
While deep networks have contributed to major leaps in raw performance across various applications, they are also known to be quite brittle to targeted data perturbations.
By adding a small amount of adversarial noise to the data, it is possible to drastically change the output of a deep network. The existence of these so-called adversarial examples, perturbed data points which fool the model, pose a serious risk for safety- and security-centric applications where reliability and robustness are critical. In this dissertation, we present and analyze a number of approaches for mitigating the effect of adversarial examples, also known as adversarial defenses. These defenses can offer varying degrees and types of robustness, and in this dissertation we study defenses which differ in the strength of the the robustness guarantee, the efficiency and simplicity of the defense, and the type of perturbation being defended
against. We start with the strongest type of guarantee called provable adversarial defenses, showing that is possible to compute duality-based certificates that guarantee no adversarial examples exist within an `p-bounded region, which are trainable and can be minimized to learn networks which are provably robust to adversarial attacks. The approach is agnostic to the specific architecture and is applicable to arbitrary computational graphs, scaling to medium sized convolutional networks with random projections. We then switch gears to developing a deeper understanding of a more empirical defense known as adversarial training. Although adversarial training does not come with formal guarantees, it can learn networks more efficiently and with better empirical performance against attacks. We study the optimization process and reveal
several intriguing properties of the robust learning problem, finding that a simple modification to one of the earliest adversarial attacks can be sufficient to learn networks
robust to much stronger attacks, as well as finding that adversarial training as a general procedure is highly susceptible to overfitting. These discoveries have significant
implications on both the efficiency of adversarial training as well as the state of the field: for example, virtually all recent algorithmic improvements in adversarial training can be matched by simply using early stopping. The final component of this dissertation expands the realm of adversarial examples beyond `p-norm bounded perturbations, to enable more realistic threat models
for applications beyond imperceptible noise. We define a threat model called the Wasserstein adversarial example, which captures semantically meaningful image
transformations like translations and rotations previously uncaptured by existing threat models. We present an efficient algorithm for projecting onto Wasserstein
balls, enabling both generation of and adversarial training against Wasserstein adversarial examples. Finally, we demonstrate how to generalize adversarial training
to defend against multiple types of threats simultaneously, improving upon naive aggregations of adversarial attacks.
- Doctor of Philosophy (PhD)