Carnegie Mellon University
Browse
- No file added yet -

Towards Multilingual Vision-Language Models

Download (34.68 MB)
thesis
posted on 2022-12-16, 20:54 authored by Po-yao HuangPo-yao Huang

With the exploding amount of user-generated multimodal data, learning multimodal representations has enabled many novel vision-language applications in recent years. While there are around 6,500 languages worldwide, most vision-language models and their datasets are English-based. This constraint, unfortunately, hinders current models from benefiting the broader non-English community. Therefore, it is urgent yet rewarding to develop methods that generalize English-based vision-language models to non-English languages. 

My thesis work makes progress on multiple fronts of this challenge via exploring the emerging trend of learning multilingual multimodal representations that facilitates modeling and reasoning over heterogeneous content including image, video, and text in various languages. 

In the first part of this thesis, I identify the limitations in existing English-image representation learning to pave the path towards generalized multilingual multimodal representation learning. While prior work mainly associates whole images to the corresponding English captions, I argue such correspondence should be more finegrained and even multilingual. The results show that learning attention-based and object-oriented multilingual multimodal representations effectively improves end tasks such as cross-modality search and multimodal machine translation. 

The second part of this thesis studies cross-lingual generalizations of visionlanguage models. I address the scalability challenge in large-scale task-agnostic multilingual multimodal pre-training and the lack-of-annotation challenge when finetuning on the end task. To learn with noisy million-scale uncurated instructional videos and their transcriptions in various languages, I analyze the desirable supportingset size in multimodal self-supervised learning and propose a reconstruction objective to alleviate such bottleneck. Additionally, I explore multilingual multimodal pretraining and construct the Multi-HowTo100M dataset, a collection of 120M video clips and their transcriptions in 9 languages, to improve zero-shot cross-lingual transfers of vision-language models. Finally, in task-specific fine-tuning, I exploit automated visual semantics to learn with sparse English-vision annotations. When non-English annotations are scarce or unavailable, I investigate visual-pivoting supervised and unsupervised multimodal machine translation to translate English-vision data into non-English-vision for multilingual multimodal fine-tuning. 

The combined effort in this thesis leads to notable breakthroughs for enhancing the cross-lingual generalization capabilities of vision-language models. I believe the proposed methodologies and the resources released will be a crucial step towards multilingual vision-language models. 

History

Date

2021-08-01

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Alexander G. Hauptmann

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC