Robust Multimodal Learning from Language, Visual and Acoustic Modalities

Bagher Zadeh, Amirali

doi:10.1184/R1/28824311.v1

Robust Multimodal Learning from Language, Visual and Acoustic Modalities

thesis

posted on 2025-04-24, 19:37 authored by Amirali Bagher ZadehAmirali Bagher Zadeh

As we build new AI technologies that are able to interact with the real world around them, the problem of learning from multiple modalities takes a center stage. From applications such as healthcare to education to communication, increasing reliance on multiple modalities has proven to be a unique factor in more accurate perception and processing the world around us. In this thesis, we focus on the problem of learning multimodal representations in the real-world. We outline three main challenges in multimodal machine learning and take concrete steps to approach them. First, we tackle the challenge of local fusion where the focus is on learning the cross-modal dynamics including unimodal, bimodal and trimodal interactions between the modalities of language, vision and acoustic (the three most commonly present modalities around us). Subsequently, we leap towards temporal fusion, in which the local fusion challenges extend to a temporal domain. Temporal fusion requires alignment between modalities, which is as vital as learning cross-modal dynamics. Subsequently, the third challenge deals with the fact that multimodal data is almost always partially observable in the real-world. We extend the capabilities of the Variational Inference (VI) to deal with even the most extreme cases of missing rates and missing patterns. Throughout addressing these challenges, which are studied in depth in this thesis, we make algorithmic, theoretical and empirical contributions to multimodal machine learning.

History

Date

2021-08-21

Degree Type

Dissertation

Department

Language Technologies Institute

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Louis-Philippe Morency

Usage metrics

Keywords

multimodal learning deep learning

Licence

CC BY 4.0

Robust Multimodal Learning from Language, Visual and Acoustic Modalities

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports