Foundations of Multisensory Artificial Intelligence

Liang, Pu

doi:10.1184/R1/26239949.v1

Foundations of Multisensory Artificial Intelligence

thesis

posted on 2024-07-22, 19:24 authored by Pu LiangPu Liang

Building multisensory artificial intelligence systems that learn from multiple sensory inputs such as text, speech, video, real-world sensors, wearable devices, and medical data holds great promise for impact in many scientific areas with practical benefits, such as in supporting human health and well-being, enabling multimedia content processing, and enhancing real-world autonomous agents.

However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a range of theoretical frameworks and application domains, this thesis aims to advance the foundations of multimodal machine learning. We start by defining three key principles of modality heterogeneity, connections, and interactions often present in multimodal problems [371]. Using these principles as a foundation, we propose a taxonomy of six core challenges in multimodal research: representation, alignment, reasoning, generation, transference, and quantification. Recent technical achievements will be presented through this taxonomy, allowing researchers to understand the similarities and differences across approaches, and identifying open problems for future research.

The bulk of the thesis covers our recent progress towards tackling two key prob?lems in multimodal learning: the machine learning foundations of multimodal interactions, as well as practical methods for building multisensory foundation models that generalize to many modalities and tasks in the real world.

In the first part, we study the foundations of multimodal interactions: the basic principle of how modalities combine to give rise to new information for a task. We present a theoretical framework formalizing how modalities interact with each other to give rise to new information for a task, such as sarcasm identified from the incongruity between spoken words and vocal expressions [372]. Using this theoretical framework, we propose two practical estimators to quantify the interactions in real-world datasets. Quantifying the types of interactions a multimodal task requires enables researchers to decide which modality to collect [376], design suitable approaches to learn these interactions [374], and analyze whether their model has succeeded in learning [375].

In the second part, we study the design of practical multimodal foundation models that generalize over many modalities and tasks, which presents a step toward grounding large language models to real-world sensory modalities. We first introduce MULTIBENCH, a unified large-scale benchmark across a wide range of modalities, tasks, and research areas [367]. We will also present the cross-modal attention [101, 359] and multimodal transformer [613] architectures that now underpin many of today’s multimodal foundation models. Scaling these architectures on MULTIBENCH enables the creation of general-purpose multimodal multitask models across a variety of tasks, and we have collaborated broadly with practitioners to apply these models for real-world impact on affective computing, mental health, and cancer prognosis.

We conclude this thesis by discussing how future work can leverage these ideas toward more general, interactive, and safe multimodal artificial intelligence.

Funding

SCH: INT: Collaborative Research: Dyadic Behavior Informatics for Psychotherapy Process and Outcome

Directorate for Computer & Information Science & Engineering

Find out more...

CAREER: Learning Nonverbal Signatures

Directorate for Computer & Information Science & Engineering

Find out more...

Automatic Multimodal Affect Detection for Research and Clinical Use

National Institute of Mental Health

Find out more...

MAPS: Mobile Assessment for the Prediction of Suicide

National Institute of Mental Health

Find out more...

History

Date

2024-04-15

Degree Type

Dissertation

Department

Machine Learning

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Louis-Philippe Morency Ruslan Salakhutdinov

Usage metrics

Keywords

Multimodal Machine Learning Multisensory Artificial Intelligence Deep Learn?ing Information Theory Quantification Generalization Affective Computing AI and Healthcare Artificial Intelligence and Image Processing

Licence

CC BY 3.0

Foundations of Multisensory Artificial Intelligence

Funding

SCH: INT: Collaborative Research: Dyadic Behavior Informatics for Psychotherapy Process and Outcome

CAREER: Learning Nonverbal Signatures

Automatic Multimodal Affect Detection for Research and Clinical Use

MAPS: Mobile Assessment for the Prediction of Suicide

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports