Robust Multimodal Learning from Language, Visual and Acoustic Modalities
As we build new AI technologies that are able to interact with the real world around them, the problem of learning from multiple modalities takes a center stage. From applications such as healthcare to education to communication, increasing reliance on multiple modalities has proven to be a unique factor in more accurate perception and processing the world around us. In this thesis, we focus on the problem of learning multimodal representations in the real-world. We outline three main challenges in multimodal machine learning and take concrete steps to approach them. First, we tackle the challenge of local fusion where the focus is on learning the cross-modal dynamics including unimodal, bimodal and trimodal interactions between the modalities of language, vision and acoustic (the three most commonly present modalities around us). Subsequently, we leap towards temporal fusion, in which the local fusion challenges extend to a temporal domain. Temporal fusion requires alignment between modalities, which is as vital as learning cross-modal dynamics. Subsequently, the third challenge deals with the fact that multimodal data is almost always partially observable in the real-world. We extend the capabilities of the Variational Inference (VI) to deal with even the most extreme cases of missing rates and missing patterns. Throughout addressing these challenges, which are studied in depth in this thesis, we make algorithmic, theoretical and empirical contributions to multimodal machine learning.
History
Date
2021-08-21Degree Type
- Dissertation
Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)