Learning Generalizable Visual Representations Towards Novel Viewpoints, Scenes and Vocabularies
Deep learning has made significant progress to analyze an unprecedented amount of rich visual information from the real world to enable applications such as robotics, surveillance, and public safety monitoring. The successful deployment of deep learning techniques highly relies on the availability of large-scale domain-specific annotated data. However, these constraints are unlikely to be met in many real-world scenarios. In practice, various domain gaps exist between the training and test data. Test data are typically drawn from out-of-domain distributions, encompassing novel viewpoints, varied noise conditions, and diverse scenes. In addition to the diversity in visual representations, deep learning models trained on fixed, closed-set labels may not meet the query requirements of arbitrary text prompts from users. Additionally, novel vocabularies may not be accessible during training. To enable the deployment of a robust visual perception system, learning generalized feature representations during training is crucial.
In this thesis, with the goal of developing systems which can generalize to novel viewpoints, scenes and vocabularies, we explore different representation learning methods based on Siamese learning, masked visual modeling, and generatively pretraining. This thesis consists of three parts. The first part conducts robust semantic instance segmentation for videos and 3D data. We aim to learn feature representations that are invariant to various viewpoints and noise conditions via Siamese learning. We propose to leverage temporal consistency for videos and spatial consistency for 3D volumetric images, such that the learned feature representations have strong generalization ability. In the second part, we tackle the problem of human action analysis, which requires the model to learn from dynamic cues. We propose representation learning techniques based on masked visual modeling, such that the model can learn better spatial-temporal context. We also exploit both RGB videos and 3D human meshes for robust multi-modal action analysis. Finally, in the third part, we leverage generatively pre-trained vision-language models and develop systems that can handle novel vocabularies and text prompts. Our final goal is to build a robust system that can generalize to novel viewpoints, scenes, and vocabularies
History
Date
2024-12-06Degree Type
- Dissertation
Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)