Improving Optical Character Recognition for Endangered Languages
Much of the text data that exists in many languages is locked away in nondigitized books and documents. This is particularly true in the case of most endangered languages, where little to no machine-readable text is available, but printed materials such as cultural texts, educational books, and notes from linguistic documentation frequently exist. Extracting text from these materials to a machinereadable format is useful for a multitude of reasons. It can aid endangered language preservation and accessibility efforts by archiving the texts and making them searchable for language learners and speakers, as well as enable the development of natural language processing systems for endangered languages.
Optical character recognition (OCR) is typically used to extract text from such documents, but state-of-the-art OCR systems need large amounts of text data and transcribed images to train highly-performant models. These resources are often unavailable for endangered languages and because OCR models are not designed to work well in low-resource scenarios, transcriptions of endangered language documents are far less accurate than higher-resourced counterparts.
In this thesis, we address the task of improving OCR in order to produce highquality transcriptions of documents that contain text in endangered languages. We use the technique of OCR post-correction, where the goal is to correct errors in existing OCR outputs to increase accuracy. We propose a suite of methods that are tailored to learning from small amounts of data, and empirically show significantly reduction in error rates than existing OCR systems in low-resource settings.
We first present a benchmark dataset for the task of OCR on endangered language texts, containing transcriptions of printed documents in four critically endangered languages, and extensively analyze the shortcomings of existing methods on this dataset, finding that there is considerable room for improvement. Then, we present two models for fixing recognition errors in OCR outputs, targeted to data-scarce settings: (1) a neural OCR post-correction method that leverages highresource translations and structural biases to train a better-performing model; and (2) a semi-supervised technique that efficiently uses unlabeled scanned images (which are easier to obtain than manually annotated documents) for learning a post-correction model by combining self-training with automatically derived lexica. Additionally, we investigate the real-world impact our proposed models could have on endangered language revitalization by conducting a comprehensive case study on the Kwak’wala language. The case study includes a human-centered evaluation that quantitatively analyzes the utility of post-correction in reducing the manual effort needed for language documentation tasks. Further, to make state-ofthe-art OCR technologies (including our post-correction method) more accessible to users who may not have a technical background, we develop a web application that abstracts away software and scripting details and allows users to easily experiment with a variety of OCR tools and train models on new languages. We make the software to use the post-correction models proposed in this thesis publicly available, with the hope of enabling model development for new languages and orthographies, and facilitating improvements in text recognition pipelines for low-resource and endangered languages at a global scale.
History
Date
2022-09-22Degree Type
- Dissertation
Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)