Data Efficient Multilingual Natural Language Processing

Wang, Xinyi

doi:10.1184/R1/24943440.v1

xinyiw1_PhD_LTI_2022.pdf (4.28 MB)

Data Efficient Multilingual Natural Language Processing

thesis

posted on 2024-01-12, 21:10 authored by Xinyi WangXinyi Wang

The adoption of neural network models has led to state-of-the-art performance in many NLP tasks on major languages that have large amounts of data (Devlin et al., 2019; Vaswani et al., 2017), but their improvements often lag behind for low-resource languages (Koehn and Knowles, 2017; Sennrich and Zhang, 2019a). This imbalance in NLP progress could lead to increasing disparities between people from different regions or in different socioeconomic conditions. The goal of this thesis is to develop methods which efficiently utilize the available data resources to build competitive NLP systems for all languages.

We focus on multilingual training, a particularly effective strategy for improving the model quality of low-resource languages while training parameter-efficient models (Zoph et al., 2016; Neubig and Hu, 2018; Devlin et al., 2019; Conneau et al., 2019). We identify three major challenges facing multilingual models. (1) The standard word embedding representation hinders the model’s generalization to training signals across different languages, mainly because it does not have good inductive biases to account for the lexical similarities and discrepancies between different languages. (2) Searching for good multilingual data selection and balancing strategies requires multiple runs of model retraining because multi?lingual datasets are often highly imbalanced across different languages. (3) It is challenging to adapt a multilingual model to languages with very limited resources. To tackle the first two challenges for multilingual training, we propose better word representation methods for multilingual data that encourage positive transfer between languages, and design automatic methods to select and balance multilingual training data. To tackle the third challenge, we explore novel methods that adapt multilingual models to support language varieties that are often overlooked in existing multilingual benchmarks through model ensembling and data augmentation.

History

Date

2022-07-09

Degree Type

Dissertation

Department

Language Technologies Institute

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Graham Neubig

Usage metrics

Keywords

natural language processing multilingual learning machine translation

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Data Efficient Multilingual Natural Language Processing

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports