Massively Multilingual Text Translation for Low -Resource Languages
Translation into severely low-resource languages has both the cultural goal of saving and reviving those languages and the humanitarian goal of assisting the everyday needs of local communities that are accelerated by the recent COVID-19 pandemic. In many humanitarian efforts, translation into severely low-resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, low-resource languages may be possible and reduce human translation effort. We attempt to leverage translation resources from rich-resource languages to efficiently produce best possible translation quality for well known texts, which are available in multiple languages, in a new, low-resource language.
To achieve this efficiency, we translate a closed text that is known in advance and available in multiple source languages into a new and low-resource language. Despite the challenges of little data and few human experts, we build methods to promote cross-lingual transfer, leverage paraphrase diversity, address the variable-binding problem, measure language similarity, build efficient active learning algorithms for learning seed sentences, activate knowledge in large pretrained models and produce quality translation with as small as a few hundreds lines of low-resource data. Working with extremely small data, we demonstrate that it is possible to produce useful translations for machines to work alongside human translators to expedite the translation process, which is exactly the goal of this thesis.
To reach this goal, we argue that in translating a closed text into low-resource languages, generalization to out-of-domain texts is not necessary, but generalization to new languages is. Performance gain comes from massive source parallelism by careful choice of close-by language families, style-consistent corpus?level paraphrases within the same language and strategic adaptation of existing large pretrained multilingual models to the domain first and then to the language. Such performance gain makes it possible for machine translation systems to collaborate with human translators to expedite the translation process into new
History
Date
2023-12-01Degree Type
- Dissertation
Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)