Towards Multi-Task Multi-Modal Models: A Video Generative Perspective

Yu, Lijun

doi:10.1184/R1/25901659.v1

Towards Multi-Task Multi-Modal Models: A Video Generative Perspective

thesis

posted on 2024-06-26, 18:38 authored by Lijun YuLijun Yu

Advancements in language foundation models have primarily fueled the recent surge in artificial intelligence. In contrast, generative learning of non-textual modalities, especially videos, significantly trails behind language modeling. This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions, as well as for understanding and compression applications.

We start with two pixel-space prototypes for separate multi-task and multimodal setups. Despite their effectiveness, these models are constrained by taskspecific modules and predefined label spaces, underscoring the need for more universally applicable designs.

Given the high dimensionality of visual data, we pursue concise and accurate latent representations. Our video-native spatial-temporal tokenizers preserve high fidelity. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Furthermore, our scalable visual token representation proves beneficial across generation, compression, and understanding tasks. This achievement marks the first instances of language models surpassing diffusion models in visual synthesis and a video tokenizer outperforming industry-standard codecs.

Within these multi-modal latent spaces, we study the design of multi-task generative models. Our masked multi-task transformer excels at the quality, efficiency, and flexibility of video generation. We enable a frozen language model, trained solely on text, to generate visual content. Finally, we build a scalable generative multi-modal transformer trained from scratch, enabling the generation of videos containing high-fidelity motion with the corresponding audio given diverse conditions.

Throughout the course, we have shown the effectiveness of integrating multiple tasks, crafting high-fidelity latent representation, and generating multiple modalities. This work suggests intriguing potential for future exploration in generating non-textual data and enabling real-time, interactive experiences across various media forms.

History

Date

2024-04-29

Degree Type

Dissertation

Department

Language Technologies Institute

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Alexander Hauptmann

Usage metrics

Keywords

Multi-Modal Multi-Task Video Generation Visual Tokenization Generative Transformer Foundation Models Representation Learning Visual Understanding Information and Computing Sciences not elsewhere classified

Licence

CC BY 4.0

Towards Multi-Task Multi-Modal Models: A Video Generative Perspective

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports