End-to-End Modeling for Abstractive Speech Summarization

Sharma, Roshan

doi:10.1184/R1/25451419.v1

Sharma_cmu_0041E_11134.pdf (1.42 MB)

End-to-End Modeling for Abstractive Speech Summarization

thesis

posted on 2024-04-10, 21:01 authored by Roshan SharmaRoshan Sharma

In our increasingly interconnected world, where speech remains the most intuitive and natural form of communication, spoken language processing systems face a crucial challenge: they must do more than just categorize speech, they need to truly understand it to generate meaningful responses. One key aspect of this understanding is speech summarization, where a system condenses the important information from spoken input into a concise summary. This thesis delves into the challenge of generating abstractive textual summaries directly from speech.

The classical approach involves cascade systems that realize speech summarization by first transcribing speech, and then summarizing the resulting transcript. However, this comes with many challenges including computational efficiency, domain mismatches, and error propagation. In this thesis, we propose an alternative—an end-to-end framework that directly optimizes a single sequence model for speech summarization. To implement such end-to-end models with constrained computing resources, we address challenges such as abstract learning, learning global acoustic context, dealing with paucity of data, and improving the quality of summaries using multiple references. We also shed light on observations from human annotation for speech summarization. We present multi-stage training using speech transcription as a pre-training task to address abstract learning and facilitate improved performance of end-to-end models. We describe multiple solutions to address the problem of global acoustic context—restricted self-attention, replacing self-attention with the Fourier transform, and two block-wise adaptation solutions BASS and R-BASS that reframe speech summarization through the lens of block-wise processing. To address the challenge of data paucity, we introduce work on two new datasets—SLUE-TED and Interview for abstractive speech summarization. An exploration of human annotation provides insights into best practices and the nature of differences between speech-based and transcript-based summaries. Finally, we propose a novel method called AugSumm to improve the diversity and fluency of speech summaries by leveraging auxiliary references from generative text models.

History

Date

2024-03-17

Degree Type

Dissertation

Department

Electrical and Computer Engineering

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Raj Bhiksha

Usage metrics

Keywords

cascade systems abstractive speech summarization electrical engineering

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

End-to-End Modeling for Abstractive Speech Summarization

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports