Towards Efficient Neural Machine Translation
Machine translation (MT), the use of machines to automatically translate from one language to others, aims to overcome language barriers among people from different cultures. Recently, neural networks-based machine translation (NMT) models significantly narrow the gap between machine and human translations in terms of translation accuracy. However, some new challenges [90] are also introduced and efficiency is one of the most important issues. Specifically, with a complex deep network structure, NMT models generally have high space and computational costs, hindering their deployment in real-time applications with strict latency requirements or devices with limited memory resources. In this thesis, we aim to improve the decoding efficiency of NMT from three aspects, (1) computational efficiency: NMT, similar to other deep learning models, employs a deep network structure with a large number of parameters and high model complexity, resulting in high memory usage and a relatively slow decoding process; (2) decoding parallelizability efficiency: another main reason of the low-speed inference process for NMT is the autoregressive property of its decoder that only generates one token at a time and can not be parallelizable; (3) efficiency in multilingual NMT: to better support translations between multiple languages, a popular strategy is to employ a deeper encoder and decoder structure with the increased model capacity to handle multiple languages. However, the extra latency and memory costs introduced by this approach make it unacceptable for efficiency-constrained applications.
This thesis consists of three parts to tackle these challenges respectively. First, to improve the computational efficiency, we focus on some modules of NMT and develop novel structures and learning algorithms including (1) investigating word encoding mechanisms to significantly reduce the time and space consumption of the embedding and softmax layers; (2) developing a linear unified nested attention mechanism that approximates regular attention, yielding only linear (as opposed to quadratic) time and space complexity. Then we relieve the autoregressive property in the conventional NMT decoding algorithm to speed up the decoding process, which includes (1) designing a semi-autoregressive decoding algorithm which keeps the autoregressive property locally but avoids it globally; (2) developing a fully nonautoregressive translation system in which all tokens are generated in parallel. Finally, we investigate the decoding efficiency in the multilingual translation scenario which consists of (1) studying the speed-accuracy trade-off for multilingual translation; (2) improving the decoding speed through the model capacity allocations from different granularity while maintaining superior translation quality.
History
Date
2022-05-31Degree Type
- Dissertation
Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)