Semi-Supervised Learning for Speech Recognition in the Context of Accent Adaptation
Accented speech that is under-represented in the training data still suffers high Word Error Rate (WER) with state-of-the-art Automatic Speech Recognition (ASR) systems. Careful collection and transcription of training data for different accents can address this issue, but it is both time consuming and expensive. However, for many tasks such as broadcast news or voice search, it is easy to obtain large amounts of audio data from target users with representative accents, albeit without accent labels or even transcriptions. Semi-supervised training have been explored for ASR in the past to leverage such data, but many of these techniques assume homogeneous training and test conditions. In this paper, we experiment with cross-entropy based speaker selection to adapt a source recognizer to a target accent in a semi-supervised manner, using additional data with no accent labels. We compare our technique to self-training based only on confidence scores and show that we obtain significant improvements over the baseline by leveraging additional unlabeled data on two different tasks in Arabic and English.