Improve Latent Semantic Analysis Based Language Model by Integrating Multiple Level Knowledge
We describe an extension to the use of Latent Semantic Analysis (LSA) for language modeling. This technique makes it easier to exploit long distance relationships in natural language for which the traditional n-gram is unsuited.
However, with the growth of length, the semantic representation of the history may be contaminated by irrelevant information, increasing the uncertainty in predicting the next word. To address this problem, we propose a multilevel framework dividing the history into three levels corresponding to document, paragraph and sentence. To combine the three levels of information with the n-gram, a Softmax network is used. We further present a statistical
scheme that dynamically determines the unit scope in the generalization stage. The combination of all the techniques leads to a 14% perplexity reduction on a subset of Wall Street Journal, compared with the trigram model.