Carnegie Mellon University
sachink_PhD_LTI_2023.pdf (3.02 MB)

Optimization Methods for Improving Diversity in Language Technologies

Download (3.02 MB)
posted on 2023-12-13, 21:24 authored by Sachin KumarSachin Kumar

Language use varies across individuals, communities, and populations giving rise to different variations with diverging vocabularies, syntax, semantics, and pragmatics. Despite rapid improvements in natural language processing systems on standard benchmarks in several languages, these models often fail to represent this diversity. In this thesis, I aim to develop methods to make NLP systems understand and generate natural languages, while explicitly modeling extra-linguistic variables associated with diverse language use. 

Reformulating conventional training and inference problems in neural network-based NLP models as instances of multi-objective optimization, this thesis is divided into two parts. In the first part, (a) I present a method to train robust text classification models demoting reliance on spurious correlations in data – with applications to detecting language varieties as well as other tasks where patterns of variation are confounds; (b) I present a prompting framework to contextualize text classifiers for pragmatic tasks to different domains, and social and personal factors of variation. In the second part, I focus on enriching diversity in text generation. I present (c) a training algorithm for machine translation that separates token representation learning from model learning resulting in improved lexical diversity in the generated text. We show that it lends to easy adaptability to generate closely related dialects of the target language. Finally, I present (d) decoding algorithms to control for stylistic variations from pretrained language models. I frame controlled decoding as constrained optimization and develop gradient-based methods to generate text non-autoregressively which initialize and update the entire output sequence iteratively. We validate these approaches with different types of controls on machine translation, style transfer, and open-ended generation. Overall, this thesis aims to advance research directions in NLP beyond standardized language towards societal use, where research questions and methodology are guided by relevant training and inference objectives. 




Degree Type

  • Dissertation


  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)


Yulia Tsvetkov

Usage metrics



    Ref. manager