Optimization Methods for Improving Diversity in Language Technologies

Kumar, Sachin

doi:10.1184/R1/24579058.v1

sachink_PhD_LTI_2023.pdf (3.02 MB)

Optimization Methods for Improving Diversity in Language Technologies

thesis

posted on 2023-12-13, 21:24 authored by Sachin KumarSachin Kumar

Language use varies across individuals, communities, and populations giving rise to different variations with diverging vocabularies, syntax, semantics, and pragmatics. Despite rapid improvements in natural language processing systems on standard benchmarks in several languages, these models often fail to represent this diversity. In this thesis, I aim to develop methods to make NLP systems understand and generate natural languages, while explicitly modeling extra-linguistic variables associated with diverse language use.

Reformulating conventional training and inference problems in neural network-based NLP models as instances of multi-objective optimization, this thesis is divided into two parts. In the first part, (a) I present a method to train robust text classification models demoting reliance on spurious correlations in data – with applications to detecting language varieties as well as other tasks where patterns of variation are confounds; (b) I present a prompting framework to contextualize text classifiers for pragmatic tasks to different domains, and social and personal factors of variation. In the second part, I focus on enriching diversity in text generation. I present (c) a training algorithm for machine translation that separates token representation learning from model learning resulting in improved lexical diversity in the generated text. We show that it lends to easy adaptability to generate closely related dialects of the target language. Finally, I present (d) decoding algorithms to control for stylistic variations from pretrained language models. I frame controlled decoding as constrained optimization and develop gradient-based methods to generate text non-autoregressively which initialize and update the entire output sequence iteratively. We validate these approaches with different types of controls on machine translation, style transfer, and open-ended generation. Overall, this thesis aims to advance research directions in NLP beyond standardized language towards societal use, where research questions and methodology are guided by relevant training and inference objectives.

History

Date

2023-09-14

Degree Type

Dissertation

Department

Language Technologies Institute

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Yulia Tsvetkov

Usage metrics

Keywords

natural language processing machine translation lexical diversity

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Optimization Methods for Improving Diversity in Language Technologies

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports