Carnegie Mellon University
Browse

File(s) under embargo

1

month(s)

6

day(s)

until file(s) become available

Everyday Conversation Speech Recognition with End-to-End Neural Networks

thesis
posted on 2024-07-10, 16:59 authored by Xuankai ChangXuankai Chang

 Automatic speech recognition (ASR) is an essential technology which facilitates effective human-computer interaction. With the rapid progress in deep learning techniques, end-to-end (E2E) neural network-based ASR has brought significant advancements with remarkable performance. The success of ASR models have inspired various applications such as virtual assistants and automatic transcription services. Despite these achievements, recognizing conversational speech remains a challenging task, especially in the presence of environmental noise, room reverberations and speech overlaps. 

This thesis aims to address the challenges of recognizing everyday conversation speech in ASR systems using E2E neural networks. The proposed research will explore techniques and methodologies to enhance the performance of ASR in challenging real-word conversational scenarios. We divide the problem into several sub-problems focusing on speech overlaps, noise, and reverberations, where each of these factors will be individually analyzed and addressed. In addition, we conduct diverse investigations on E2E neural network architectures to leverage the benefits of joint training to handle these challenges. Specifically, we build E2E ASR models by integrating ad-hoc modules, including speech enhancement, feature extraction and speech recognition. 

We begin with the fundamental task of speech recognition using single-channel input containing a single speaker. Environmental noises and room reverberations significantly degrade speech recognition performance in such scenarios. To address this challenge, we propose a novel model architecture, integrating speech enhancement, self-supervised learning, and ASR models into a single neural network with an efficient training strategy. This integration has led to notable performance improvements, demonstrating the feasibility and effectiveness of employing end-to-end (E2E) neural networks for speech recognition with complex acoustic and linguistic properties. We then extend our approach to accept multi-channel speech input with a single speaker. Inspired by recent advancements in large speech foundation models, we expand the capabilities of a model trained on thousands of hours of single-channel speech data to handle multi-channel input. This extension significantly enhances performance, particularly evident in real meeting transcription data. Furthermore, we address the challenge of speech overlaps, an area that has been underexplored. Overlapping speech poses difficulties in accurately decoding and aligning individual utterances. To tackle this, we propose several end-to-end (E2E) models designed specifically to recognize overlapping speech within single-channel input. Finally, we turn our attention to multi-channel speech input with speech overlaps present in the signal. We introduce a model capable of processing multi-channel input from multiple speakers, leveraging spatial information for improved performance. We also integrates various approaches proposed earlier, further enhancing its effectiveness in challenging scenarios. 

History

Date

2024-06-01

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Shinji Watanabe

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC