What is the Longformer Transformer and how does it work?

Ask Questions Forum: ask Machine Learning Questions to our readersCategory: Deep LearningWhat is the Longformer Transformer and how does it work?
Chris Staff asked 10 months ago
1 Answers
Best Answer
Chris Staff answered 10 months ago

Transformers have really changed the NLP world, in part due to their self-attention component. But this component is problematic in the sense that it has quadratic computational and memory growth with sequence length, due to the QK^T diagonals (Questions, Keys diagonals) in the self-attention component. By consequence, Transformers cannot be trained on really long sequences because resource requirements are just too high. BERT, for example, sets a maximum sequence length of 512 characters.

Longformer has set a step into a new direction. It is a modified Transformer architecture that employs a linearly scalable self-attention mechanism, meaning that memory and computational growth happens linearly with sequence length instead of quadratically. It can be used the same variety of language tasks e.g. long document classification, question answering. In doing so, it (1) avoids the need for shortening the input sequences, incurring information loss; and (2) it removes the need for task-specific architectures to make such shortening possible.

Let’s take a closer look at the Longformer paper and summarize it here. Before we start, however, let’s make sure to include the contributions of the Longformer architecture:
1. The introduction of an adapted attention mechanism which uses a windowed local-context self-attention and an end-task motivated global attention. It allows Longformer to understand local context while having a grasp about global information as well. We’ll dive into that in more detail below.
2. Applying this new mechanism in other document-level tasks, even in existing pretrained Transformers, extending previous work that primarily focused on autoregressive language modeling (LM). After showing that it works with LM, the authors pretrain with masked language modeling (MLM) and subsequent finetuning on a variety of downstream tasks. This also works.
3. Showing that the attention mechanism can be applied to ‘classic’ Transformers as well, i.e. having it work with an encoder-decoder Seq2Seq architecture instead of with a BERT-like encoder only. They call this Longformer-Encoder-Decoder, or LED.
Rationale for Longformer: previous work
There are two streams of thought that provide the rationale for Longformer: (1) the fact that Long-Document Transformers are still somewhat of a dream, and (2) the need to avoid task-specific architectures that often emerge with non-Long Document Transformers in order to be able to process long documents.

(About Long-Document Transformers)
-> In previous work on Long-Document Transformers, two types of self-attention have been investigated: left-to-right (ltr) attention processing the document in chunks, and a “sparse attention pattern” based approach that avoids quadratic computations for the attention matrix.
-> The left-to-right approach has shown some success with Autoregressive Language Modeling (LM; i.e. GPT-like) tasks, but they cannot be used in other approaches, such as transfer learning in a pretraining-finetuning setting, if the task benefits from bidirectionality.
-> The sparse attention pattern based approach has been explored as well. Generally, dilated sliding windows are applied in this context, sliding over the input embedding when computing attention, significantly reducing computational needs. As we shall see, Longformer extends this local attention with a global attention mechanism from which global context-based tasks (e.g. sentiment analysis) can benefit. Since sparse attention was mostly applied to LM before, applying it to different problems in Longformer is a step forward.
(Avoiding Task-Specific Architectures)
-> BERT shortens input sequences to 512 tokens because it cannot process more of them. This is just one example: subsequently, many task-specific approaches have built task-specific architectures to handle input text.
-> Examples are truncating documents, chunking documents of length 512 then combining activations for each chunk with a task specific model, or a multihop approach where stage one retrieves relevant documents and stage two processes them.
-> The problem with these approaches is that information is lost. The idea behind Longformer is that no task-specific truncation whatsoever is necessary anymore, and that long documents can be input and processed, avoiding the information loss.
(Contemporary works like Longformer)
-> There are some works that have also explored sparse attention mechanisms like Longformer’s, which we will take a look at shortly. Let’s however mention these other architectures first.
-> ETC (Ainslie et al., 2020) use local + global attention but change the architecture somewhat.
-> GMAT (Gupta and Berant, 2020) use the idea of global memory.
-> BigBird (Zaheer et al., 2020) extends ETC evaluating it on additional tasks, showing that sparse Transformers are universal approximators, showing that they can work in theory on all tasks that can be solved by classic Transformers.
-> The original Transformer has a self-attention component: we all know that. Its time and memory complexity is O(n^2) where n is input sequence length. This is a challenge with large sequences, because a lot of memory and hence time is needed, rendering training models an almost impossible task.
-> Longformer sparsifies the full self-attention matrix which scales linearly with input sequence, adding efficiency. We’ll now take a look at the pattern and its implementation.
(Attention Pattern)
-> Sliding windows: Longformer employs a fixed-size windowed attention surrounding each token. This means that attention does no longer capture attention from all other tokens, but only for the window. Stacking many layers of such sliding windows on top of each other (which is common with any Transformer) still leads to a big receptive field and hence to no issues regarding unseen text. Computation complexity is O(n x w) when sliding windows are employed, where w is window size. This is significantly more efficient compared to O(n^2) from the original Transformer, as w << n.
-> Dilated sliding windows: to further increase receptive field without increasing computation, Longformer employs “dilation” where each window has gaps of fixed size d. Different dilation configurations are employed per attention head, improving performance.
-> Global Attention: finetuning on specific task requires that input is specified differently; this becomes clear from all task-specific architectures. Input specification also impacts attention in classic Transformers, meaning that classic Transformers generally attend more to parts important to the outcome task. Dilated sliding windows don’t offer this but rather provide a fixed means of sparse attention. That’s why Longformer also adds “task-oriented global attention”, a symmetric a priori addition of attention to specific tokens. For example, classification (e.g. sentiment analysis) requires global attention to be applied on the CLS token, whereas question answering requires applying global attention to all question tokens.
-> Two sets of Questions, Keys, Values are used for both dilated sliding window based attention and global attention, providing flexibility to model the different types of attention.
-> Implementing the dilated sliding windows requires a type of implementation not readily available in libraries like PyTorch or TensorFlow. That’s why the authors have developed the mechanism themselves in three different ways:
-> The ‘loop’ way is a memory efficient PyTorch implementation, but is slow.
-> The ‘chunks’ way only support non-dilated sliding windows.
-> The ‘cuda’ way utilizes a custom CUDA kernel and is fully functional and highly optimized.
Does it work? Experiments with (Autoregressive) Language Modeling
-> Recall that Autoregressive Language Modeling (LM) involves “estimating the probability distribution of an existing token/character given its previous token/characters in an input sequence”. In other words, it involves predicting the next token given previous tokens using some maximum likelihood. It is one of the fundamental tasks in NLP. Longformer is evaluated on a LM task to see if it can be competitive with current approaches.
-> Longformer for LM uses dilated sliding window attention with different window sizes across layers: small window sizes for lower layers and larger ones in higher layers. This way, top layers can learn higher-level representations while lower ones learn more fine-grained ones – as well as finding a balance between efficiency and performance. Dilation itself seems to be applied in higher layers only.
-> The training procedure for LM involves increasing the window size after every across training phases. This ensures that the model is first capable of learning local context, then learning global context towards the end. This is good, because learning global context is very costly. In other words, training is kept as fast as possible, while keeping the slow part to the end.
-> The model is trained for over 5 phases with starting sequence length of 2048, ending with 23040 in the last phase.
-> Evaluation happens with sequences of length 32256.
-> The model achieves state-of-the-art results in text8 and enwik8 datasets.
-> Large Longformer models also outperform comparable Transformer approaches, such as Transformer-XL, while matching performance of Sparse Transformer – which is comparable. While it underperforms some other approaches (Adaptive Span; Compressive Transformer) that have twice as many parameters, these latter ones can only be used with a LM approach – and the benefit of Longformer is that it can also be used with the pretraining-finetuning paradigm.
An ablation study performed by training a variety of configurations for 150K steps shows that:
-> Increasing the window size from bottom to top layer leads to best performance.
-> The reverse order leads to worse performance.
-> Using a fixed window size gives a performance in between.
-> Adding dilation to two heads leads to some performance improvement compared to no dilation at all.

Does it work? Masked Language Modeling / Pretraining for Finetuning
-> Beyond Autoregressive Language Modeling, Longformer was also tested on Masked Language Modeling, in a Pretraining-Finetuning setting.
-> To do so, it was pretrained on a document corpus and finetuned for six tasks, including classification, question answering, and coreference resolution.
-> The resulting model can process sequences up to 4096 tokens long (8 times longer than BERT).
-> MLM was used for pretraining (common in BERT). Pretraining started from a RoBERTa checkpoint released earlier, making minimal changes necessary to support the attention mechanism, which can be plugged in any pretrained Transformer model without the need to change architectures.

(Attention Pattern)
-> Sliding window attention with window size of 512, equalling computation requirements of RoBERTa (for comparison).

(Position Embeddings)
-> Extra position embeddings are added to support 4096 tokens. Extra ones are initialized by copying the 512 tokens from RoBERTa multiple times – very effective as found by results.

(Continued MLM Pretraining)
-> Pretraining was performed on a self-constructed corpus.
-> Two models were trained: a base model and a large model.
-> Both trained for 65K gradient updates, sequence length 4096, batch size 64, max LR of 3e-5, linear warmup of 500 steps, followed by power 3 polynomial decay.
-> Results suggest that copied position embeddings significantly improve the model rather than random initialization. In addition, 2K training steps from RoBERTa’s checkpoint improve results, and even further when trained for 65K steps. This indicates that the model learns to better utilize the attention mechanism and thus the larger context.

(Frozen RoBERTa Weights)
-> Results suggest that freezing the RoBERTa weights during pretraining leads to worse performance.

Finetuning on downstream language tasks
Longformer was applied to a variety of long document tasks: question answering, coreference resolution, classification. For comparison, a RoBERTa based model is chosen that takes the longest possible inputs.

(Question answering)
-> Fine-tuned on WikiHop, TriviaQA and HotpotQA.
-> WikiHop and TriviaQA training was performed according BERT style, concatenating questions and documents. Global attention was applied to question tokens and answer candidates for WikiHop; to question tokens only for TriviaQA.
-> HotpotQA is a multihop QA task and hence a two-stage model was used. It first selects the most relevant paragraphs and then passes them to a second stage for answering. Preprocessing included concatenating question and context into one sequence.

(Coreference Resolution)
-> OntoNotes was used for testing, without global attention.

(Document Classification)
-> IMDB and Hyperpartisan news detection was used.
-> In IMBD, most documents are short, but 13.6% is larger than 512 wordpieces.
-> Hyperpartisan has relatively long documents, but is small in total size, and thus a good test for Longformer’s adaptability to limited data.
-> Global attention used on the <CLS> token.

-> Longformer outperforms the RoBERTa baseline in all cases, and especially in cases where long context is required.

Longformer-Encoder-Decoder (LED)
-> While the original Transformer was a Seq2Seq model with an encoder-decoder architecture, both BERT and GPT – the popular choices these days – use encoder or decoder only.
-> The Longformer-Encoder-Decoder (LED) variant of Longformer introduces a full encoder-decoder architecture again, for specific tasks like summarization, also to solve the issue with long inputs there.
-> It still uses the efficient local+global attention pattern, but on the encoder only, while the decoder utilizes the full self-attention mechanism.
-> It scales linearly with input.
-> Initialized from BART, but then with position embeddings extended to 16K tokens up from BART’s 1K tokens. Here too, they are initialized by copying over the BART tokens.
-> Two variants released: LED-base and LED-large, with 6/12 layers in both encoder/decoder stacks.
-> LED was evaluated on Arxiv summarization data. It has texts with > 14.5K tokens. For the evaluation, it was trained using teacher forcing on gold training summaries and uses beam search at inference.
-> On Arxiv, state-of-the-art results are observed. This is remarkable, especially given the fact that LED is not pretrained! Pretraining should improve results even further.
In other words, input length matters! The better models can support longer input, the better they can perform after training. Longformer was created for this purpose, using a pluggable sparse attention mechanism that combines dilated windowed attention for local context with full global attention on some tokens, of which the latter varies per task.
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., … & Ahmed, A. (2020). Big bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062.
Gupta, A., & Berant, J. (2020). Gmat: Global memory augmentation for transformers. arXiv preprint arXiv:2006.03274.
Ainslie, J., Ontanon, S., Alberti, C., Pham, P., Ravula, A., & Sanghai, S. (2020). ETC: Encoding long and structured data in transformers. arXiv preprint arXiv:2004.08483.

Your Answer

19 + 8 =