Carnegie Mellon University
Browse

Designing Transparent and Factual Text Generation Systems Grounded in Language Structure

Download (8.05 MB)
thesis
posted on 2024-05-03, 15:38 authored by Vidhisha Balachandran

  Large language models have brought about a shift towards constructing large, general purpose computational  models of language, moving away from task-specific architectures. These models, trained on massive  unstructured data, are opaque and challenging to control by design. Consequently, such data-driven models  tend to overfit to spurious artifacts, perform poorly on underrepresented data, and fail in unpredictable  ways. Thus, a paradigm shift towards developing trustworthy systems to ensure fairness, accountability, and  robustness in their outcomes is essential. In this thesis, I argue that leveraging language structures to design  trustworthy systems can facilitate this shift.

  This thesis presents methods and solutions that leverage language structure to improve the trustworthiness,  transparency, and reliability of large-scale, data-driven language generation models, across various stages of  the model pipeline. The thesis is divided into three parts. The first part introduces semantically grounded  evaluation measures and analysis to assess the factual reliability of trained language generation models. The  second part presents model designs that incorporate inter-sentence structures to promote inductive biases and  transparency. Finally, the third part presents techniques that use syntactic structures to generate synthetic,  general, high-quality datasets for training robust and factual systems. The thesis highlights the challenges in  developing trustworthy language generation models and proposes solutions that utilize language structure to  improve their interpretability and factual reliability by design

History

Date

2024-04-01

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Yulia Tsvetkov

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC