Robust Recognition of Binaural Speech Signals Using Techniques Based on Human Auditory Processing

2019-03-08T17:28:55Z (GMT) by Anjali Menon
Automatic Speech Recognition (ASR) engines are extremely susceptible to noise. There is an increasing prevalence of voice-assisted devices which need to recognize speech accurately in a variety of complex listening environments. These include the presence of<br>background noise, reverberation, and multiple talkers.<br>The human auditory system, on the other hand, is very good at understanding speech even in extremely challenging environments. It might therefore, be useful to use our knowledge of human hearing to develop techniques that lead to robust speech recognition. This entails applying techniques that have their basis in human auditory processing towards automatic speech recognition (ASR).<br>In this thesis, we discuss a number of techniques that address the problem of robust recognition of binaural signals in the presence of reverberation and multiple talkers since<br>they pose a significant problem in terms of ASR engine performance. The techniques discussed here roughly follow the manner in which the auditory system achieves noise<br>robustness. The fundamental idea behind all the techniques proposed is that sounds emanating from the same sound source exhibit some degree of coherence. We aim to use this property to achieve better isolation of the target signal leading to better speech recognition<br>accuracy. Three techniques are proposed. The Interaural Cross-correlation-basedWeighting (ICW) algorithm looks for coherence across sensors using signal envelopes in order to isolate signals coming from the same location. To reduce the effect of reverberation, steady-state suppression is applied as an initial step. The ICW algorithm combined with steady-state suppression leads to significant improvements in ASR accuracy. The Coherence-to-Diffuse Ratio-based Weighting (CDRW) algorithm uses a model-based technique to evaluate the ratio of coherent energy to diffuse energy in a given signal. This leads to significantly better performance in ASR. The third technique is the Cross-Correlation across Frequency (CCF) algorithm, which looks for coherence in frequency for signal separation. The CCF algorithm<br>also effectively smooths the signal. This algorithm has been tested in conjunction with steady-state suppression and ITD-based analysis. The CCF algorithm leads to improvements<br>in ASR especially in the presence of moderate to high reverberation when the system is trained on clean speech. All algorithmswere tested using DNN-based acoustical<br>models obtained with the Kaldi speech recognition toolkit, using both clean and multistyle training data. <br>