Automatic Extraction and Application of Language Descriptions for Under-Resourced Languages
Languages of our world are amazingly diverse, consisting of varied and complex systems of word, phrase, sentence construction, and vocabulary, to name a few. Understanding these systems is critical not only for language communication, but they also drive the design and development of language technologies. Creating a language description that illustrates such salient points of a language is therefore one of the major endeavours undertaken by language experts, and in fact, forms an indispensable step for language documentation and preservation efforts (Himmelmann, 1998; Moline, 2020). Manually creating such detailed descriptions for several languages that are usable by humans and machines can be challenging; therefore, in this thesis we explore whether we can automate some of the processes involved in the language description creation and create language descriptions in a format usable by both humans and machines.
Thanks to advances in natural language processing (NLP) research, we can automate some local aspects of linguistic analysis, such as identifying the syntactic function of a word (POS tagging) or identifying grammatical relations (dependency parsing). We take advantage of such advances to extract and explain complex linguistic behaviors, covering aspects of morphology, syntax, and lexical semantics that apply to language in general. To achieve this goal, we develop a system AutoLEX1 which automatically extracts these linguistic insights in a human- and machine-readable format for several languages. In the first part of the thesis, we describe this general framework, which takes as input a text corpus of the language of interest and a linguistic question that we are interested in exploring. AutoLEX converts this into an NLP prediction task and produces a concise description which answers that question. As part of this framework, we develop manual and automatic evaluation methods to evaluate the resulting descriptions. We further demonstrate the application of these language descriptions in real-world settings of language analysis and education.
In the second part of the thesis, we describe how to improve the NLP building blocks that inform AutoLEX, particularly for under-resourced languages. Most state-of-the-art methods that are involved in the building blocks (e.g. performing local linguistic analysis like POS tagging) require an abundance of labeled data, which is often not readily available for many languages. Therefore, we focus on improving these methods for such under-resourced languages. Specifically, we explore: 1) Cross-lingual Transfer Learning (CLTL) (Zoph et al., 2016), which leverages existing labeled data and models from high-resource languages and, 2) Active Learning (Lewis and Gale, 1994; Settles and Craven, 2008) (AL) which helps train models by collecting labeled data in the under-resourced language while minimizing human annotation effort. We propose combining both in a unified framework where CLTL helps improve the performance of the AL learner.
- Language Technologies Institute
- Doctor of Philosophy (PhD)