The CMU Statistical Language Modeling Toolkit and its use in the 1994 ARPA CSR Evaluation
The Carnegie Mellon Statistical Language Modeling (CMU SLM) Toolkit is a set of Unix software tools designed to facilitate language modeling work in the research community. The package, including source code, is freely available for research purposes. As of December 1994, the toolkit is in active use by 23 research groups in 8 countries. It was recently used to process the 2.5 GB NAB corpus for the ARPA CSR community. In this paper, I ﬁrst discuss the design principles and features of the toolkit. Then, I describe the composition of the NAB corpus, and report on the ngram statistics, standard vocabulary and language models created using the SLM tools.