Carnegie Mellon University
Browse

Leveraging Word and Phrase Alignments for Multilingual Learning

Download (1.82 MB)
thesis
posted on 2022-12-16, 20:53 authored by Junjie HuJunjie Hu

Recent years have witnessed impressive success in natural language processing (NLP) thanks to the advances of neural networks and the availability of large amounts of labeled data. However, many NLP systems predominately have focused on highresource languages (e.g., English, Chinese) that have large, computationally accessible collections of labeled data for training. While the achievements on high-resource languages are exciting, there are more than 6,900 languages in the world and the majority of them have far fewer resources for training deep neural networks. In fact, it is often expensive, or sometimes infeasible, to collect labeled data written in all possible languages. As a result, this data scarcity issue limits the generalization of NLP systems in many multilingual scenarios. Moreover, as models may be used to process text from a wide range of domains (e.g., social media or medical articles), the data scarcity issue is further exacerbated by the domain shift between the training and test data. 

In this thesis, with the goal of improving the generalization ability of NLP models to alleviate the aforementioned challenges, we exploit word and phrase alignment to train neural NLP models (e.g., neural machine translation or contextualized language models), and provide evaluation methods for examining the generalization capabilities of such models over diverse application scenarios. This thesis contains two parts. The first part explores cross-lingual generalization for language understanding. In particular, we examine the ability of pre-trained multilingual representations to transfer learned knowledge from a high-resource language to other languages. To this end, we first introduce a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. Second, we leverage word and sentence alignments from parallel data to improve the multilingual representations for language understanding tasks such as those included in our benchmark. The second part of the thesis is devoted to leveraging alignment information for machine translation, a popular and useful language generation task. In particular, we focus on learning to translate aligned words and phrases between two languages with fewer parallel sentences. To accomplish this goal, we exploit techniques to obtain aligned words and phrases from monolingual data, knowledge bases or crowdsourcing and use them to improve translation systems. 

History

Date

2021-08-13

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Dr. Graham Neubig

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC