Carnegie Mellon University
Browse

Improving Optical Character Recognition for Endangered Languages

Download (26.79 MB)
thesis
posted on 2023-01-06, 21:32 authored by Shruti RijhwaniShruti Rijhwani

Much of the text data that exists in many languages is locked away in nondigitized books and documents. This is particularly true in the case of most endangered languages, where little to no machine-readable text is available, but printed materials such as cultural texts, educational books, and notes from linguistic documentation frequently exist. Extracting text from these materials to a machinereadable format is useful for a multitude of reasons. It can aid endangered language preservation and accessibility efforts by archiving the texts and making them searchable for language learners and speakers, as well as enable the development of natural language processing systems for endangered languages.

Optical character recognition (OCR) is typically used to extract text from such documents, but state-of-the-art OCR systems need large amounts of text data and transcribed images to train highly-performant models. These resources are often unavailable for endangered languages and because OCR models are not designed to work well in low-resource scenarios, transcriptions of endangered language documents are far less accurate than higher-resourced counterparts. 

In this thesis, we address the task of improving OCR in order to produce highquality transcriptions of documents that contain text in endangered languages. We use the technique of OCR post-correction, where the goal is to correct errors in existing OCR outputs to increase accuracy. We propose a suite of methods that are tailored to learning from small amounts of data, and empirically show significantly reduction in error rates than existing OCR systems in low-resource settings. 

We first present a benchmark dataset for the task of OCR on endangered language texts, containing transcriptions of printed documents in four critically endangered languages, and extensively analyze the shortcomings of existing methods on this dataset, finding that there is considerable room for improvement. Then, we present two models for fixing recognition errors in OCR outputs, targeted to data-scarce settings: (1) a neural OCR post-correction method that leverages highresource translations and structural biases to train a better-performing model; and (2) a semi-supervised technique that efficiently uses unlabeled scanned images (which are easier to obtain than manually annotated documents) for learning a post-correction model by combining self-training with automatically derived lexica. Additionally, we investigate the real-world impact our proposed models could have on endangered language revitalization by conducting a comprehensive case study on the Kwak’wala language. The case study includes a human-centered evaluation that quantitatively analyzes the utility of post-correction in reducing the manual effort needed for language documentation tasks. Further, to make state-ofthe-art OCR technologies (including our post-correction method) more accessible to users who may not have a technical background, we develop a web application that abstracts away software and scripting details and allows users to easily experiment with a variety of OCR tools and train models on new languages. We make the software to use the post-correction models proposed in this thesis publicly available, with the hope of enabling model development for new languages and orthographies, and facilitating improvements in text recognition pipelines for low-resource and endangered languages at a global scale. 

History

Date

2022-09-22

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Graham Neubig

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC