Carnegie Mellon University
Browse

Learning Internal Structure of Words with Factored Generative Models

Download (1.14 MB)
thesis
posted on 2020-12-04, 17:10 authored by Jonghyuk Park
While it has recently been the dominant trend in NLP to treat words as flat vectors, we are well aware that words have intelligible structures within. In this thesis, I aim to extract more structured, interpretable representations of words that reflect such inner workings, by means of generative modeling. Such representations are discovered in attempts to model the phenomenon of inflectional morphology, by
which words are systematically categorized into different grammatical classes, and manifest form changes accordingly to such classes. Approaches to studying internal structures of words can categorize into one of the three families of morphological theories: morpheme-based item-and-arrangement (IA) models, lexeme-based item-and-process (IP) models, and word-based wordand-
paradigm (WP) models. Here, I pay attention to the fact there have been relatively few attempts to computationally learn IP morphologies, especially in unsupervised
settings. As the modern NLP community is equipped with more powerful tools to model—or approximate—complicated distributions, we are now in a good position to attempt extracting morphological models based on functional processes. In this work, I employ two different techniques to approach the problem of learning IP morphologies by generative modeling. First, I implement hidden Markov
models with factored latent states, which emit continuous word embeddings, as a non-neural baseline. Then, I implement neural autoencoders that extend the factored
HMMs via adversarial training, to lift a number of simplifying assumptions made by the baseline. After a series of extensive evaluation experiments, I find out that although modeling of IP morphologies is an attainable goal when we have access to annotated data, purely unsupervised learning of IP morphologies remains to be a significant challenge. Specifically, it is confirmed that unrestricted training based on the common likelihood-based objective is likely to drift apart from human linguistic knowledge, without any means to regularize the training of highly complicated
models.

Funding

LTI Research Fellowship (Paid by DARPA)

History

Date

2020-10-29

Degree Type

  • Master's Thesis

Department

  • Language Technologies Institute

Degree Name

  • Master of Language Technologies

Advisor(s)

David R. Mortensen

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC