Evaluation Metrics For Language Models
The most widely-used evaluation metric for language models for speech recognition is the perplexity of test data. While perplexities can be calculated efficiently and without access to a speech recognizer, they often do not correlate well with speech recognition word-error rates. In this research, we attempt to find a measure that like perplexity is easily calculated but which better predicts speech recognition performance. We investigate two approaches; first, we attempt to extend perplexity by using similar measures that utilize information about language models that perplexity ignores. Second, we attempt to imitate the word-error calculation without using a speech recognizer by artificially generating speech recognition lattices. To test our new metrics, we have built over thirty varied language models. We find that perplexity correlates with word-error rate remarkably well when only considering n-gram models trained on in-domain data. When considering other types of models, our novel metrics are superior to perplexity for predicting speech recognition performance. However, we conclude that none of these measures predict word-error rate sufficiently accurately to be effective tools for language model evaluation in speech recognition.