Learning Internal Structure of Words with Factored Generative Models

Park, Jonghyuk

doi:10.1184/R1/13333637.v1

Learning Internal Structure of Words with Factored Generative Models

thesis

posted on 2020-12-04, 17:10 authored by Jonghyuk Park

While it has recently been the dominant trend in NLP to treat words as flat vectors, we are well aware that words have intelligible structures within. In this thesis, I aim to extract more structured, interpretable representations of words that reflect such inner workings, by means of generative modeling. Such representations are discovered in attempts to model the phenomenon of inflectional morphology, by

which words are systematically categorized into different grammatical classes, and manifest form changes accordingly to such classes. Approaches to studying internal structures of words can categorize into one of the three families of morphological theories: morpheme-based item-and-arrangement (IA) models, lexeme-based item-and-process (IP) models, and word-based wordand-

paradigm (WP) models. Here, I pay attention to the fact there have been relatively few attempts to computationally learn IP morphologies, especially in unsupervised

settings. As the modern NLP community is equipped with more powerful tools to model—or approximate—complicated distributions, we are now in a good position to attempt extracting morphological models based on functional processes. In this work, I employ two different techniques to approach the problem of learning IP morphologies by generative modeling. First, I implement hidden Markov

models with factored latent states, which emit continuous word embeddings, as a non-neural baseline. Then, I implement neural autoencoders that extend the factored

HMMs via adversarial training, to lift a number of simplifying assumptions made by the baseline. After a series of extensive evaluation experiments, I find out that although modeling of IP morphologies is an attainable goal when we have access to annotated data, purely unsupervised learning of IP morphologies remains to be a significant challenge. Specifically, it is confirmed that unrestricted training based on the common likelihood-based objective is likely to drift apart from human linguistic knowledge, without any means to regularize the training of highly complicated

models.

Funding

LTI Research Fellowship (Paid by DARPA)

History

Date

2020-10-29

Degree Type

Master's Thesis

Department

Language Technologies Institute

Degree Name

Master of Language Technologies

Advisor(s)

David R. Mortensen

Usage metrics

Keywords

Generative model Adversarial learning word representation morphology learning item-and-process morphology Natural Language Processing

Licence

CC BY-NC-SA 4.0

Learning Internal Structure of Words with Factored Generative Models

Funding

LTI Research Fellowship (Paid by DARPA)

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports