Carnegie Mellon University
xinjianl_PhD_LTI_2023.pdf (3.43 MB)

Low-Resource Speech Recognition for Thousands of Languages

Download (3.43 MB)
posted on 2023-08-25, 20:06 authored by Xinjian LiXinjian Li

Recently, the performance of speech recognition has witnessed rapid improvement due to modern architectures. Those models typically require thousands of hours of training data for the target language. However, there are around 8000 languages in the world, the majority of which do not have any audio or text dataset, which significantly restricts the scope of target languages. 

This thesis attempts to expand the target languages of speech recognition to more than thousands of languages by reducing the dataset requirement. In particular, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics for the target language. In the minimalist assumption, we only employ the lexicon from the target language. Our speech pipeline consists of three components: acoustic model, pronunciation model, and language model. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision of the target language. 

The first part of this thesis discusses the hierarchical acoustic model which can be decomposed into two submodules: the universal phone recognition model recognizes language-independent phones using phonological articulatory features, and subsequently the allophone model mapping phones into language-dependent phonemes. In the second part, we turn our focus on the pronunciation model and language model. We develop a zero-shot learning grapheme-to-phoneme (G2P) model which approximates the target language using nearest languages from the phylogenetic tree. G2P model serves as a pronunciation model. The language model can be built using n-gram statistics or the raw text dataset. We build our language model by combining it with a large endangered languages n-gram database and a lexicon database. In the last part, we introduce two databases we use in the pipeline and relevant alignment applications. Using the proposed pipeline and datasets, we build speech recognition systems for 6185 languages, which significantly expands the scope of target languages in speech recognition. 




Degree Type

  • Dissertation


  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)


Shinji Watanabe, Alan W Black

Usage metrics



    Ref. manager