Incorporating Context Information into Deep Neural Network Acoustic Models
The introduction of deep neural networks (DNNs) has advanced the performance of automatic speech recognition (ASR) tremendously. On a wide range of ASR tasks, DNN models show superior performance than the traditional Gaussian mixture models (GMMs). Although making significant advances, DNN models still suffer from data scarcity, speaker mismatch and environment variability. This thesis resolves these challenges by fully exploiting DNNs’ ability of integrating heterogeneous features under the same optimization objective. We propose to improve DNN models under these challenging conditions by incorporating context information into DNN training.
On a new language, the amount of training data may become highly limited. This data scarcity causes degradation on the recognition accuracy of DNN models. A solution is to transfer knowledge from other languages to the low-resource condition. This thesis proposes a framework to build cross-language DNNs via languageuniversal feature extractors (LUFEs). Convolutional neural networks (CNNs) and deep maxout networks (DMNs) are employed to improve the quality of LUFEs, which enables the generation of invariant and sparse feature representations. This framework notably improves the recognition accuracy on a wide range of low-resource languages.
The performance of DNNs degrades when the mismatch between acoustic models and testing speakers exists. A form of context information which encapsulates speaker characteristics is i-vectors. This thesis proposes a novel framework to perform feature-space speaker adaptive training (SAT) for DNN models. A key component of this approach is an adaptation network which takes i-vectors as inputs and projects DNN inputs into a normalized feature space. The DNN model fine-tuned in this new feature space rules out speaker variability and becomes more independent of specific speakers. This SAT method is applicable to different feature types and model architectures.
The proposed adaptive training framework is further extended to incorporate distance- and video-related context information. The distance descriptors are extracted from deep learning models which are trained to distinguish distance types on the frame level. Distance adaptive training (DAT) using these descriptors captures speaker-microphone distance dynamically on the frame level. When performing ASR on video data, we naturally have access to both the speech and the video modality. Video- and segment-level visual features are extracted from the video stream. Video adaptive training (VAT) with these visual features results in more robust acoustic models that are agnostic to environment variability. Moreover, the proposed VAT approach removes the need for frame-level visual features and thus achieves audio-visual ASR on truly open-domain videos.
History
Date
2016-08-08Degree Type
- Dissertation
Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)