posted on 2020-12-04, 17:10authored byJonghyuk Park
While it has recently been the dominant trend in NLP to treat words as flat vectors, we are well aware that words have intelligible structures within. In this thesis, I aim to extract more structured, interpretable representations of words that reflect such inner workings, by means of generative modeling. Such representations are discovered in attempts to model the phenomenon of inflectional morphology, by
which words are systematically categorized into different grammatical classes, and manifest form changes accordingly to such classes. Approaches to studying internal structures of words can categorize into one of the three families of morphological theories: morpheme-based item-and-arrangement (IA) models, lexeme-based item-and-process (IP) models, and word-based wordand-
paradigm (WP) models. Here, I pay attention to the fact there have been relatively few attempts to computationally learn IP morphologies, especially in unsupervised
settings. As the modern NLP community is equipped with more powerful tools to model—or approximate—complicated distributions, we are now in a good position to attempt extracting morphological models based on functional processes. In this work, I employ two different techniques to approach the problem of learning IP morphologies by generative modeling. First, I implement hidden Markov
models with factored latent states, which emit continuous word embeddings, as a non-neural baseline. Then, I implement neural autoencoders that extend the factored
HMMs via adversarial training, to lift a number of simplifying assumptions made by the baseline. After a series of extensive evaluation experiments, I find out that although modeling of IP morphologies is an attainable goal when we have access to annotated data, purely unsupervised learning of IP morphologies remains to be a significant challenge. Specifically, it is confirmed that unrestricted training based on the common likelihood-based objective is likely to drift apart from human linguistic knowledge, without any means to regularize the training of highly complicated