Carnegie Mellon University
Browse
Menon_cmu_0041E_10364.pdf (9.07 MB)

Robust Recognition of Binaural Speech Signals Using Techniques Based on Human Auditory Processing

Download (9.07 MB)
thesis
posted on 2019-03-08, 17:28 authored by Anjali MenonAnjali Menon
Automatic Speech Recognition (ASR) engines are extremely susceptible to noise. There is an increasing prevalence of voice-assisted devices which need to recognize speech accurately in a variety of complex listening environments. These include the presence of
background noise, reverberation, and multiple talkers.
The human auditory system, on the other hand, is very good at understanding speech even in extremely challenging environments. It might therefore, be useful to use our knowledge of human hearing to develop techniques that lead to robust speech recognition. This entails applying techniques that have their basis in human auditory processing towards automatic speech recognition (ASR).
In this thesis, we discuss a number of techniques that address the problem of robust recognition of binaural signals in the presence of reverberation and multiple talkers since
they pose a significant problem in terms of ASR engine performance. The techniques discussed here roughly follow the manner in which the auditory system achieves noise
robustness. The fundamental idea behind all the techniques proposed is that sounds emanating from the same sound source exhibit some degree of coherence. We aim to use this property to achieve better isolation of the target signal leading to better speech recognition
accuracy. Three techniques are proposed. The Interaural Cross-correlation-basedWeighting (ICW) algorithm looks for coherence across sensors using signal envelopes in order to isolate signals coming from the same location. To reduce the effect of reverberation, steady-state suppression is applied as an initial step. The ICW algorithm combined with steady-state suppression leads to significant improvements in ASR accuracy. The Coherence-to-Diffuse Ratio-based Weighting (CDRW) algorithm uses a model-based technique to evaluate the ratio of coherent energy to diffuse energy in a given signal. This leads to significantly better performance in ASR. The third technique is the Cross-Correlation across Frequency (CCF) algorithm, which looks for coherence in frequency for signal separation. The CCF algorithm
also effectively smooths the signal. This algorithm has been tested in conjunction with steady-state suppression and ITD-based analysis. The CCF algorithm leads to improvements
in ASR especially in the presence of moderate to high reverberation when the system is trained on clean speech. All algorithmswere tested using DNN-based acoustical
models obtained with the Kaldi speech recognition toolkit, using both clean and multistyle training data.

History

Date

2019-02-23

Degree Type

  • Dissertation

Department

  • Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Richard Stern

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC