Carnegie Mellon University
Browse
- No file added yet -

Learning Computational Models of Non-­Standard Language

Download (2.39 MB)
thesis
posted on 2023-01-06, 21:43 authored by Mariia RyskinaMariia Ryskina

Non­standard language such as novel words or creative spellings of existing ones often occurs in natural text corpora, posing significant challenges for natural language processing (NLP) models. While humans can successfully infer the meaning communicated in such non­standard ways, NLP models largely discard linguistic innovation as noise, ignoring its fundamentally non­random nature and losing valuable context. In this thesis, we focus on computational modeling of such creative phenomena, aiming to both improve the automatic processing of non­standardized text data and to learn more about the linguistic and cognitive factors that allow humans to produce and understand novel linguistic items. 

We present empirical studies of several phenomena under the umbrella of nonstandard language, characterized in terms of different linguistic units (orthographic, morphological, or lexical) and considered at different levels of granularity (from individual users to entire dialects or languages). First, we show how idiosyncratic spelling preferences reveal information about the user, with an application to the bibliographic task of identifying typesetters of historical printed documents. Second, we discuss the common patterns in user­specific orthographies and demonstrate that incorporating these patterns helps with unsupervised conversion of idiosyncratically romanized text into the conventional orthography of the language. Third, we consider word emergence in a dialect or language as a whole and, in two diachronic corpora studies, model the language­internal and language­external factors that drive it. Finally, we look at how continuous emergence of novel words is reconciled with the existing system of morphological rules, focusing on generalization to unseen lemmas in morphological inflection in several languages. 

History

Date

2022-09-26

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Matthew R. Gormley, Eduard Hovy, Taylor Berg­-Kirkpatrick

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC