Improvements to speaker adaptive training of deep neural networks
Speaker adaptive training (SAT) is a well studied technique for Gaussian mixture acoustic models (GMMs). Recently we proposed to perform SAT for deep neural networks (DNNs), with speaker i-vectors applied in feature learning. The resulting SAT-DNN models significantly outperform DNNs on word error rates (WERs). In this paper, we present different methods to further improve and extend SAT-DNN. First, we conduct detailed analysis to investigate i-vector extractor training and flexible feature fusion. Second, the SAT-DNN approach is extended to improve tasks including bottleneck feature (BNF) generation, convolutional neural network (CNN) acoustic modeling and multilingual DNN-based feature extraction. Third, for transcribing multimedia data, we enrich the i-vector representation with global speaker attributes (age, gender, etc.) obtained automatically from video signals. On a collection of instructional videos, incorporation of the additional visual features is observed to boost the recognition accuracy of SAT-DNN.