Learning Computational Models of Non-­Standard Language

Ryskina, Mariia

doi:10.1184/R1/21766595.v1

Learning Computational Models of Non-Standard Language

thesis

posted on 2023-01-06, 21:43 authored by Mariia RyskinaMariia Ryskina

Nonstandard language such as novel words or creative spellings of existing ones often occurs in natural text corpora, posing significant challenges for natural language processing (NLP) models. While humans can successfully infer the meaning communicated in such nonstandard ways, NLP models largely discard linguistic innovation as noise, ignoring its fundamentally nonrandom nature and losing valuable context. In this thesis, we focus on computational modeling of such creative phenomena, aiming to both improve the automatic processing of nonstandardized text data and to learn more about the linguistic and cognitive factors that allow humans to produce and understand novel linguistic items.

We present empirical studies of several phenomena under the umbrella of nonstandard language, characterized in terms of different linguistic units (orthographic, morphological, or lexical) and considered at different levels of granularity (from individual users to entire dialects or languages). First, we show how idiosyncratic spelling preferences reveal information about the user, with an application to the bibliographic task of identifying typesetters of historical printed documents. Second, we discuss the common patterns in userspecific orthographies and demonstrate that incorporating these patterns helps with unsupervised conversion of idiosyncratically romanized text into the conventional orthography of the language. Third, we consider word emergence in a dialect or language as a whole and, in two diachronic corpora studies, model the languageinternal and languageexternal factors that drive it. Finally, we look at how continuous emergence of novel words is reconciled with the existing system of morphological rules, focusing on generalization to unseen lemmas in morphological inflection in several languages.

History

Date

2022-09-26

Degree Type

Dissertation

Department

Language Technologies Institute

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Matthew R. Gormley, Eduard Hovy, Taylor Berg-Kirkpatrick

Usage metrics

Keywords

computational linguistics natural language processing neology non-standard orthography Natural Language Processing

Licence

In Copyright

Learning Computational Models of Non-Standard Language

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports

Learning Computational Models of Non-­Standard Language

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports

Learning Computational Models of Non-Standard Language