Learning Computational Models of Non-Standard Language
Nonstandard language such as novel words or creative spellings of existing ones often occurs in natural text corpora, posing significant challenges for natural language processing (NLP) models. While humans can successfully infer the meaning communicated in such nonstandard ways, NLP models largely discard linguistic innovation as noise, ignoring its fundamentally nonrandom nature and losing valuable context. In this thesis, we focus on computational modeling of such creative phenomena, aiming to both improve the automatic processing of nonstandardized text data and to learn more about the linguistic and cognitive factors that allow humans to produce and understand novel linguistic items.
We present empirical studies of several phenomena under the umbrella of nonstandard language, characterized in terms of different linguistic units (orthographic, morphological, or lexical) and considered at different levels of granularity (from individual users to entire dialects or languages). First, we show how idiosyncratic spelling preferences reveal information about the user, with an application to the bibliographic task of identifying typesetters of historical printed documents. Second, we discuss the common patterns in userspecific orthographies and demonstrate that incorporating these patterns helps with unsupervised conversion of idiosyncratically romanized text into the conventional orthography of the language. Third, we consider word emergence in a dialect or language as a whole and, in two diachronic corpora studies, model the languageinternal and languageexternal factors that drive it. Finally, we look at how continuous emergence of novel words is reconciled with the existing system of morphological rules, focusing on generalization to unseen lemmas in morphological inflection in several languages.
History
Date
2022-09-26Degree Type
- Dissertation
Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)