Carnegie Mellon University
Browse
- No file added yet -

Speech Enhancement Using Deep Neural Networks

Download (34.13 MB)
thesis
posted on 2024-05-24, 17:17 authored by Yangyang XiaYangyang Xia

 Speech enhancement algorithms aim to improve the quality and intelligibility of speech signals degraded by noise to improve human or machine interpretation of speech. Thanks to large-scale datasets and online simulation, supervised algorithms based on deep neural networks can accurately suppress non-stationary noise, making them useful in practice for real-time communication systems and as the front end of automatic speech recognition systems. Despite all the advances, the extent to which these algorithms are robust to adverse acoustic conditions and phonetic categories of speech stimuli is still being investigated.

 This thesis addresses supervised speech enhancement in three parts. First, we describe the four-region error that serves as a diagnostic tool for speech enhancement algorithms. Compared to popular perceptual measures of speech quality, the four-region error distinguishes between two universal problems: under-suppression and over-suppression. We will show that all algorithms exhibit a trade-off between these error types and describe loss functions that balance the two. Second, we address the under-suppression problem within the frequency-domain speech enhancement framework. In the domain of instantaneous signal-to-noise ratio (ISNR), we unify algorithms trained on different targets. We will show that all methods face inevitable uncertainties as the ISNR decreases. We then introduce uncertainty learning that quantifies these uncertainties and improves noise reduction capability. Third, we address the over-suppression problem by incorporating phonetic information into the supervised framework. Through measurements of phonetically-dependent four-region error, we identify the over-suppression problem in obstruents in American English as the critical challenge of frequency-domain algorithms. We further identify a class of time-domain algorithms that exhibit different trade-offs and use them to train a phonetic segregation network. Finally, we explore phonetically-dependent channel selection rules to improve automatic speech recognition accuracy.  

History

Date

2024-05-04

Degree Type

  • Dissertation

Department

  • Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Richard M. Stern

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC