Carnegie Mellon University
Browse

Robust Multimodal Learning from Language, Visual and Acoustic Modalities

Download (16.94 MB)
thesis
posted on 2025-04-24, 19:37 authored by Amirali Bagher ZadehAmirali Bagher Zadeh

As we build new AI technologies that are able to interact with the real world around them, the problem of learning from multiple modalities takes a center stage. From applications such as healthcare to education to communication, increasing reliance on multiple modalities has proven to be a unique factor in more accurate perception and processing the world around us. In this thesis, we focus on the problem of learning multimodal representations in the real-world. We outline three main challenges in multimodal machine learning and take concrete steps to approach them. First, we tackle the challenge of local fusion where the focus is on learning the cross-modal dynamics including unimodal, bimodal and trimodal interactions between the modalities of language, vision and acoustic (the three most commonly present modalities around us). Subsequently, we leap towards temporal fusion, in which the local fusion challenges extend to a temporal domain. Temporal fusion requires alignment between modalities, which is as vital as learning cross-modal dynamics. Subsequently, the third challenge deals with the fact that multimodal data is almost always partially observable in the real-world. We extend the capabilities of the Variational Inference (VI) to deal with even the most extreme cases of missing rates and missing patterns. Throughout addressing these challenges, which are studied in depth in this thesis, we make algorithmic, theoretical and empirical contributions to multimodal machine learning.

History

Date

2021-08-21

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Louis-Philippe Morency

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC