Linguistic Knowledge for Neural Language Generation and Machine Translation

Matthews, Austin

doi:10.1184/R1/21605850.v1

Linguistic Knowledge for Neural Language Generation and Machine Translation

thesis

posted on 2022-12-02, 20:33 authored by Austin MatthewsAustin Matthews

Recurrent neural networks (RNNs) are exceptionally good models of distributions over natural language sentences, and they are deployed in a wide range of applications that require the generation of natural language outputs. However, RNNs are general-purpose function learners that, given sufficient capacity, are capable of representing any distribution, whereas the space of possible natural languages is narrowly constrained. Linguistic theory has been concerned with characterizing these constraints, with a particular eye toward explaining the uniformity with which children acquire their first languages, despite receiving relatively little linguistic input. This thesis uses insights from linguistic theory to inform the neural architectures and generation processes used to model natural language, seeking models that make more effective use of limited amounts of training data. Since linguistic theories are incomplete, a central goal is developing models that are able to exploit explicit linguistic knowledge while still retaining the generality and flexibility of the neural network models they augment.

This thesis examines two linguistic domains: word formation and sentence structure. First, in the word formation domain, we introduce a language model that captures subword word formation using linguistic knowledge about morphological processes via finite state analyzers hand-crafted by linguistic experts. Our model is capable of using several levels of granularity, including the raw word-, character- and morpheme-levels to encode and condition on previous words as well as to construct its predicted next word. As a result, it is fully open vocabulary, capable of producing any token admitted by a language’s alphabet. These properties make it ideal for modelling languages with potentially unbounded vocabulary size, such as Turkish and Finnish.

Second, in the sentence structure domain, we present a pair of dependency-based language models, leveraging syntactic theories that construct sequences of words as the outputs of hierarchical branching processes. Our models construct syntax trees either topdown or bottom-up, jointly learning language modelling and parsing. We find that these dependency-based models make good parsers, but that dependencies are less effective than phrase-structure trees for modelling language.

Finally, again looking at sentence structure, we investigate the application of syntax to the task of conditional language modelling, where data scarcity exacerbates the need for more sample efficient models. We develop a fully neural tree-to-tree translation system, leveraging syntax in both the source and target languages to do conditional language modelling. We then ablate the model, demonstrating the effects of a source-side syntaxbased encoder and a target-side syntax-based decoder separately. We find that source-side syntax to hold good promise, and show that inference under neural models is trapped in a local optimum wherein biased models perversely synergize with poor inference procedures. This interaction means improvements in modelling and in decoding algorithms do not necessarily lead to improved quality metrics.

These models demonstrate the effectiveness of hybrid techniques that marry the expressive power of neural networks with explicit linguistic structure derived from human analyses. This synergy allows models to be both more sample efficient and more interpretable.

History

Date

2019-11-12

Degree Type

Dissertation

Department

Language Technologies Institute

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Chris Dyer

Usage metrics

Keywords

Recurrent neural networks word formation sentence structures conditional language modelling Natural Language Processing Linguistics not elsewhere classified

Licence

In Copyright

Linguistic Knowledge for Neural Language Generation and Machine Translation

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports