Carnegie Mellon University
Browse
Wang_cmu_0041E_11063.pdf (25.05 MB)

On the Feature Alignment of Deep Vision Models: Explainability and Robustness Connected At Hip

Download (25.05 MB)
thesis
posted on 2023-09-07, 20:55 authored by Zifan WangZifan Wang

Deep Neural Networks (DNNs) have recently demonstrated remarkable performance that is comparable to humans. However, these models pose a challenge when it comes to answering whether their behaviors, ethical values, and morality always align with humans’ interests. This issue is known as (mis)alignment of intelligent systems. One basic requirement for deep classifiers to be considered aligned is that their output is always semantically equivalent to that of a human, who possesses the necessary knowledge and tools to solve the problem at hand. Unfortunately, verifying the alignment between models and humans on outputs is often not feasible, as it would be impractical to test every sample from the distribution. 

A lack of output alignment of DNNs has been evidenced by their vulnerability to adversarial noise, which are unlikely to affect a human’s response. This weakness originates from the fact that important features used by the model may not be semantically meaningful from a human perspective, an issue which we will term as feature (mis)alignment in vision tasks. Being (perceptually) aligned with humans on useful features is necessary to preserve output alignment. Thus, the goal of this thesis is to evaluate and enhance the feature alignment of deep vision classifiers to promote output alignment. 

To evaluate feature alignment, we introduce locality, a metric based on explainability tools that guarantee faithful returns of important features contributing towards the models’ outputs. Consequently, the first contribution of the thesis shows that modern architectures, e.g., Vision Transformers (ViTs), the stateof-the-art classifiers on many tasks, are misaligned in features. Our second contribution, on the other hand, shows that improved adversarial robustness leads to improved locality. To be specific, we find that a robust model has better locality than any non-robust model and the locality of a model increases as it becomes more robust. Inspired by this finding, our third contribution is to improve robustness with a novel technique, TrH regularization, based on a direct minimization of PAC-Bayesian generalization bound for robustness. Our technique provides the new state-of-the-art robustness for ViTs. However, as robustness is often measured by running existing attacks, the guarantee is only empirical and may fail against adaptive attacks. The last contribution of this thesis introduces GloRo Nets, which entail a built-in formal robustness verification layer based on the global Lipschitz constant of the model. Unlike a probabilistic guarantee provided by Randomized Smoothing, GloRo Nets have a deterministic guarantee and significantly improve the state-of-the-art provable robustness under ℓ2-norm-bounded threats. 

Robustness is necessary for feature alignment but is probably not sufficient, as there are many other unspecified requirements that would result in misalignment. In conclusion, the thesis discusses the issue of under-specification in classification and its connection to alignment, together with potential remedies for addressing the issue as another step towards feature alignment in deep learning 

History

Date

2023-08-15

Degree Type

  • Dissertation

Department

  • Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Anupam Datta, Matt Fredrikson

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC