Carnegie Mellon University
tachungc_phd_lti_2024.pdf (3.29 MB)

Toward Length-Extrapolatable Transformers

Download (3.29 MB)
posted on 2024-06-26, 19:16 authored by Ta-Chung ChiTa-Chung Chi

 Since the advent of Transformer language models, the field of natural language processing has seen remarkable progress. Unfortunately, the complexity of training such models grows quadratically with the sequence length, making it difficult for practitioners with limited GPU resources to adopt long-sequence-length pre-training. One way to address this limitation is to allow the model to handle much longer sequences during testing without further parameter updates. This capability, known as length extrapolation, is nontrivial and presents several challenges.

 Firstly, classic Transformer language models rely on per-position positional embeddings to provide positional information; this may be problematic when unseen positions are encountered during the extrapolation stage. Secondly, models pre-trained on short sequences struggle when directly fed with longer sequences due to the length distributional shift problem. Maintaining stable perplexities on longer sequences has proven challenging with existing approaches. Finally, the evaluation of length extrapolation capability often relies solely on natural language perplexity; this might not tell us the whole story as natural language is highly localized as opposed to regular language and downstream tasks such as long-context QA and code completion. 

This thesis addressed the aforementioned challenges from three perspectives. Part I investigates the role of positional embeddings in Transformer language models. This thesis demonstrates that strong positional signals are still encoded in the hidden states of a Transformer language model, even without explicit positional embeddings. To take advantage of this, the thesis introduces a new variant of relative positional embedding named KERPLE derived from conditionally positive definite kernels. Part II presents a thorough analysis of existing length extrapolatable Transformers by measuring the width of models’ receptive field. The key to successful length extrapolation on language modeling tasks is found to be the alignment of training and testing receptive fields. This insight leads to the proposal of a new relative positional embedding design named Sandwich, which builds upon the originally proposed Sinusoidal positional embedding. Part III examines the ability of Transformer’s length extrapolation beyond language modeling and perplexity measurement. Motivated by the recently proposed long-context retrieval tasks, this thesis provides a better understanding of the attention mechanism and advances Transformer’s implicit retrieval capability through data?dependent adjustment of the Softmax temperature. In addition, this thesis addresses Transformer’s failure on formal language extrapolation tasks. Ideas from previous work such as Weight-Sharing, Adaptive-Depth, and Sliding-Window-Attention mechanisms collectively inspire a new Transformer variant named RegularGPT, which demonstrates extrapolation capability on regular language. 

This thesis concludes its exploration of length-extrapolatable Transformers by suggesting various future directions. It outlines several concrete ideas that pave the way for future Transformer length extrapolation research.  




Degree Type

  • Dissertation


  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)


Alexander I. Rudnicky

Usage metrics



    Ref. manager