Improving the reliability of language models for summarization

Krishna, Kundan

doi:10.1184/R1/25901644.v1

kundank_PhD_lti_2024.pdf (14.92 MB)

Improving the reliability of language models for summarization

thesis

posted on 2024-06-26, 18:28 authored by Kundan KrishnaKundan Krishna

Abstractive summarization models have made rapid progress since neural networks were first used for the task. We advanced from a situation where models struggled to produce grammatical sentences, to large language models like ChatGPT producing fluent summaries, sometimes rated even better than some human-written summaries. The application of summarization models is expanding beyond the traditionally popular domains of news articles and meeting transcripts, into new niche domains like medical reports, financial articles, social media conversations, product reviews etc.

Despite the progress, the reliability of summarization models is called into question due to rare but catastrophic failure modes. For example, models are known to generate summaries containing statements which are factually incorrect or which are not supported by the input being summarized (called hallucinations). Such errors can lead to serious harm if acted upon, in high-risk applications such as healthcare and finance. When deployed in the wild, models might encounter noise in the input, which can significantly reduce summary quality. Finally, while pretraining models greatly improves the quality of its outputs, the web-sourced pretraining data can introduce negative aspects in them. Examples of it include toxic or biased outputs, and verbatim generation of copyrighted content, which has led to multiple recent lawsuits. These problems can dissuade entities from deploying summarization models in the real world.

In this thesis, we contribute methodology and resources to address the aforementioned problems in summarization models. In the first part, we propose methods to generate summaries with improved quality for inputs with challenging characteristics such as long conversations or noisy documents. We introduce a modular summary generation pipeline to handle long sequences, producing better and more factual summaries. We then characterize the impact of input noise on summarization models, and design light-weight probes to detect and remove the noise. In the second part, we introduce approaches to pretraining that shun the use of any upstream pretraining text corpora, but still deliver a large fraction of the performance gains seen by pretraining on giant web corpora. The proposed approaches include creating a pretraining corpus artificially, and re-using unlabeled text from the downstream training examples for pretraining. This part reveals that a large portion of performance gains coming from pretraining are attributable to some unknown mechanism other than knowledge transfer from large external pretraining corpora. In the third and final part, we design methods to facilitate verification of LLM-generated summaries and detect potential factual errors in it. We create a public benchmark dataset to enable training and evaluation of models for multiple tasks useful towards factchecking summaries. We then design an interactive tool to assist users in verifying LLM-generated summaries against the reference document, and show its effectiveness at highlighting errors generated by a wide variety of LLMs for documents in diverse domains.

History

Date

2024-05-01

Degree Type

Dissertation

Department

Language Technologies Institute

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Zachary C. Lipton Jeffery P. Bigham

Usage metrics

Keywords

summarization pretraining factuality of language models human-in-the-loop techniques AI safety

Licence

CC BY-SA 3.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Improving the reliability of language models for summarization

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports