Neural sequence models are typically parametrized as autoregressive models that are locally normalized. These models simplify the generation process by generating constituent tokens in a predetermined order in a stepwise manner guided by a probability distribution over the vocabulary of tokens at each step. Although they have achieved impressive performance on several language processing and generation tasks like machine translation, dialog response generation, speech processing and synthesis etc., this class of models is also known to exhibit degenerate behavior during optimization and decoding. In this thesis, I characterize some of the limitations of locally normalized models, namely exposure bias and label bias, both of which represent pernicious inductive biases associated with autoregressive models that preclude efficient training of such deep neural models. This dissertation proposes solutions to ameliorate such issues in order to train more powerful and well-behaved probabilistic sequence models.

To ameliorate exposure bias, this thesis presents two solutions that focus on making the training of the models more aware of the behavior of the downstream decoding algorithms for proper credit assignment for digression from the reference sequence during gradient-based training. The presented solutions crucially involve continuous relaxations to the commonly used discontinuous decoding procedures with neural sequence models including greedy arg-max decoding, ancestral sampling, and beam search, to enable gradient-based optimization using automatic differentiation libraries . These approaches are empirically superior to standard approaches for various natural language processing tasks like machine translation, CCG supertagging, and named entity recognition.

Next, this dissertation focuses on an entirely new class of probabilistic sequence models–globally normalized models–that accommodates more flexible generation procedures and is unlikely to suffer from exposure bias and label bias but involves tradeoffs in computational complexity. A method to train globally normalized sequence models is introduced which involves modification of the above-mentioned search-aware algorithm involving the continuous relaxation of beam search. The empirical comparison of such globally normalized models with their locally normalized counterparts, also trained via the continuous relaxation to beam search reveals that training with the globally normalized strategy results in models that are more effective at responding to search errors during training.

Following this promising behavior of globally normalized models, this thesis explores the energy-based modelling view of fully connected globally normalized models and proposes powerful bidirectional energy parametrizations for sequences. Specifically, this thesis interprets optimization of popular masked language models (MLMs) as implicit training of energy-based sequence models and introduces a strategy to correctly sample from MLMs that do not have a probabilistic interpretation on their own. This work not only introduces a strategy to sample from the MLMs but also provides evidence for efficient indirect training of energy-based sequence models.

To conclude, while autoregressive models are easy to train and efficient to use, they are addled by poor inductive bias and exhibit degenerate behavior. While the alternative class of globally normalized models comes with limitations around computational complexity, it offers amenability toward more flexible and powerful sequence models.