Carnegie Mellon University
Browse

Learning Generalizable Visual Representations Towards Novel Viewpoints, Scenes and Vocabularies

Download (22.5 MB)
thesis
posted on 2025-02-07, 20:00 authored by Xiaoyu ZhuXiaoyu Zhu

Deep learning has made significant progress to analyze an unprecedented amount of rich visual information from the real world to enable applications such as robotics, surveillance, and public safety monitoring. The successful deployment of deep learning techniques highly relies on the availability of large-scale domain-specific annotated data. However, these constraints are unlikely to be met in many real-world scenarios. In practice, various domain gaps exist between the training and test data. Test data are typically drawn from out-of-domain distributions, encompassing novel viewpoints, varied noise conditions, and diverse scenes. In addition to the diversity in visual representations, deep learning models trained on fixed, closed-set labels may not meet the query requirements of arbitrary text prompts from users. Additionally, novel vocabularies may not be accessible during training. To enable the deployment of a robust visual perception system, learning generalized feature representations during training is crucial.

In this thesis, with the goal of developing systems which can generalize to novel viewpoints, scenes and vocabularies, we explore different representation learning methods based on Siamese learning, masked visual modeling, and generatively pretraining. This thesis consists of three parts. The first part conducts robust semantic instance segmentation for videos and 3D data. We aim to learn feature representations that are invariant to various viewpoints and noise conditions via Siamese learning. We propose to leverage temporal consistency for videos and spatial consistency for 3D volumetric images, such that the learned feature representations have strong generalization ability. In the second part, we tackle the problem of human action analysis, which requires the model to learn from dynamic cues. We propose representation learning techniques based on masked visual modeling, such that the model can learn better spatial-temporal context. We also exploit both RGB videos and 3D human meshes for robust multi-modal action analysis. Finally, in the third part, we leverage generatively pre-trained vision-language models and develop systems that can handle novel vocabularies and text prompts. Our final goal is to build a robust system that can generalize to novel viewpoints, scenes, and vocabularies

History

Date

2024-12-06

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Alexander Hauptmann

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC