Carnegie Mellon University
Browse

Linguistic Knowledge for Neural Language Generation and Machine Translation

Download (1.16 MB)
thesis
posted on 2022-12-02, 20:33 authored by Austin MatthewsAustin Matthews

Recurrent neural networks (RNNs) are exceptionally good models of distributions over natural language sentences, and they are deployed in a wide range of applications that require the generation of natural language outputs. However, RNNs are general-purpose function learners that, given sufficient capacity, are capable of representing any distribution, whereas the space of possible natural languages is narrowly constrained. Linguistic theory has been concerned with characterizing these constraints, with a particular eye toward explaining the uniformity with which children acquire their first languages, despite receiving relatively little linguistic input. This thesis uses insights from linguistic theory to inform the neural architectures and generation processes used to model natural language, seeking models that make more effective use of limited amounts of training data. Since linguistic theories are incomplete, a central goal is developing models that are able to exploit explicit linguistic knowledge while still retaining the generality and flexibility of the neural network models they augment. 

This thesis examines two linguistic domains: word formation and sentence structure. First, in the word formation domain, we introduce a language model that captures subword word formation using linguistic knowledge about morphological processes via finite state analyzers hand-crafted by linguistic experts. Our model is capable of using several levels of granularity, including the raw word-, character- and morpheme-levels to encode and condition on previous words as well as to construct its predicted next word. As a result, it is fully open vocabulary, capable of producing any token admitted by a language’s alphabet. These properties make it ideal for modelling languages with potentially unbounded vocabulary size, such as Turkish and Finnish. 

Second, in the sentence structure domain, we present a pair of dependency-based language models, leveraging syntactic theories that construct sequences of words as the outputs of hierarchical branching processes. Our models construct syntax trees either topdown or bottom-up, jointly learning language modelling and parsing. We find that these dependency-based models make good parsers, but that dependencies are less effective than phrase-structure trees for modelling language. 

Finally, again looking at sentence structure, we investigate the application of syntax to the task of conditional language modelling, where data scarcity exacerbates the need for more sample efficient models. We develop a fully neural tree-to-tree translation system, leveraging syntax in both the source and target languages to do conditional language modelling. We then ablate the model, demonstrating the effects of a source-side syntaxbased encoder and a target-side syntax-based decoder separately. We find that source-side syntax to hold good promise, and show that inference under neural models is trapped in a local optimum wherein biased models perversely synergize with poor inference procedures. This interaction means improvements in modelling and in decoding algorithms do not necessarily lead to improved quality metrics. 

These models demonstrate the effectiveness of hybrid techniques that marry the expressive power of neural networks with explicit linguistic structure derived from human analyses. This synergy allows models to be both more sample efficient and more interpretable. 

History

Date

2019-11-12

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Chris Dyer