Carnegie Mellon University
xinyiw1_PhD_LTI_2022.pdf (4.28 MB)

Data Efficient Multilingual Natural Language Processing

Download (4.28 MB)
posted on 2024-01-12, 21:10 authored by Xinyi WangXinyi Wang

 The adoption of neural network models has led to state-of-the-art performance in many NLP tasks on major languages that have large amounts of data (Devlin et al., 2019; Vaswani et al., 2017), but their improvements often lag behind for low-resource languages (Koehn and Knowles, 2017; Sennrich and Zhang, 2019a). This imbalance in NLP progress could lead to increasing disparities between people from different regions or in different socioeconomic conditions. The goal of this thesis is to develop methods which efficiently utilize the available data resources to build competitive NLP systems for all languages. 

We focus on multilingual training, a particularly effective strategy for improving the model quality of low-resource languages while training parameter-efficient models (Zoph et al., 2016; Neubig and Hu, 2018; Devlin et al., 2019; Conneau et al., 2019). We identify three major challenges facing multilingual models. (1) The standard word embedding representation hinders the model’s generalization to training signals across different languages, mainly because it does not have good inductive biases to account for the lexical similarities and discrepancies between different languages. (2) Searching for good multilingual data selection and balancing strategies requires multiple runs of model retraining because multi?lingual datasets are often highly imbalanced across different languages. (3) It is challenging to adapt a multilingual model to languages with very limited resources. To tackle the first two challenges for multilingual training, we propose better word representation methods for multilingual data that encourage positive transfer between languages, and design automatic methods to select and balance multilingual training data. To tackle the third challenge, we explore novel methods that adapt multilingual models to support language varieties that are often overlooked in existing multilingual benchmarks through model ensembling and data augmentation. 




Degree Type

  • Dissertation


  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)


Graham Neubig

Usage metrics



    Ref. manager