Carnegie Mellon University
asrivats_phd_lti_2024.pdf (10.02 MB)

Leveraging Structure and Context for Language-Adjacent Representation Learning

Download (10.02 MB)
posted on 2024-05-03, 15:21 authored by Nikita Srivatsan

  When learning representations from large corpora of language data, the over whelming strategy is to interpret that data as a collection of IID samples to be modeled in isolation from one another. While this approach is in some ways beneficial  as it allows for efficient training via SGD and doesn’t rely on metadata that may not always be present, it does come with limitations. Taking advantage of more com plex structural links between individual datapoints can let information flow within  our corpora, making learned representations more context-sensitive, and allowing  for heavier parameter sharing to more easily generalize to examples from unseen  class types. In this work, we will apply this idea to a variety of settings — largely those that lie at the boundary between language and other modalities — for which  much of the existing prior work has not explicitly made use of observable structure  within the data, and also show how we can add useful inductive bias to our models  through lower-level modeling choices. In order to retain interpretability and control  we will do this both using probabilistic variational learning frameworks, and also  non-variational approaches such as checklist models and retrieval guided generation.  

This dissertation is organized into three parts which apply this broad theme to various specific applications. First we’ll examine the task of learning disentangled  representations of style and structure in digital fonts, and then apply similar modeling ideas to the task of analyzing handwriting styles of scribal hands of Linear B.  In the next part we’ll investigate ways to learn contextualized representations for  temporally ordered data where predictions for one datapoint may influence nearby predictions, such as piano fingering estimation and discursive topic modeling on social media. Finally we’ll put forward new approaches for settings where the input  signal is itself multimodal, such as the task of writing descriptive captions for images and music. 




Degree Type

  • Dissertation


  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)


Taylor Berg-Kirkpatrick

Usage metrics


    Ref. manager