Leveraging Word and Phrase Alignments for Multilingual Learning

Hu, Junjie

doi:10.1184/R1/21707984.v1

Leveraging Word and Phrase Alignments for Multilingual Learning

thesis

posted on 2022-12-16, 20:53 authored by Junjie HuJunjie Hu

Recent years have witnessed impressive success in natural language processing (NLP) thanks to the advances of neural networks and the availability of large amounts of labeled data. However, many NLP systems predominately have focused on highresource languages (e.g., English, Chinese) that have large, computationally accessible collections of labeled data for training. While the achievements on high-resource languages are exciting, there are more than 6,900 languages in the world and the majority of them have far fewer resources for training deep neural networks. In fact, it is often expensive, or sometimes infeasible, to collect labeled data written in all possible languages. As a result, this data scarcity issue limits the generalization of NLP systems in many multilingual scenarios. Moreover, as models may be used to process text from a wide range of domains (e.g., social media or medical articles), the data scarcity issue is further exacerbated by the domain shift between the training and test data.

In this thesis, with the goal of improving the generalization ability of NLP models to alleviate the aforementioned challenges, we exploit word and phrase alignment to train neural NLP models (e.g., neural machine translation or contextualized language models), and provide evaluation methods for examining the generalization capabilities of such models over diverse application scenarios. This thesis contains two parts. The first part explores cross-lingual generalization for language understanding. In particular, we examine the ability of pre-trained multilingual representations to transfer learned knowledge from a high-resource language to other languages. To this end, we first introduce a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. Second, we leverage word and sentence alignments from parallel data to improve the multilingual representations for language understanding tasks such as those included in our benchmark. The second part of the thesis is devoted to leveraging alignment information for machine translation, a popular and useful language generation task. In particular, we focus on learning to translate aligned words and phrases between two languages with fewer parallel sentences. To accomplish this goal, we exploit techniques to obtain aligned words and phrases from monolingual data, knowledge bases or crowdsourcing and use them to improve translation systems.

History

Date

2021-08-13

Degree Type

Dissertation

Department

Language Technologies Institute

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Dr. Graham Neubig

Usage metrics

Keywords

natural language processing multilingual learning cross-lingual transfer learning machine translation domain adaptation deep learning Natural Language Processing

Licence

In Copyright

Leveraging Word and Phrase Alignments for Multilingual Learning

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports