The Bidirectional and Auto-Regressive Transformer or BART is a Transformer that combines the Bidirectional Encoder (i.e. BERT like) with an Autoregressive decoder (i.e. GPT like) into one Seq2Seq model. In other words, it gets back to the original Transformer architecture proposed by Vaswani, albeit with a few changes.
Let’s take a look at it in a bit more detail.
Also recall GPT, where the model is autoregressive, and where the task is to predict the next token:
BART combines the two approaches into one:
- It is a standard Vaswani et al. based Transformer except for a few changes. First, GELU activations are used in the feedforward segments instead of ReLU activations. In addition, the parameters are initialized from a N(0, 0.02) distribution. The ‘small’ BART model uses 6 encoder and 6 decoder segments; this is 12 for the ‘large’ BART model.
- Contrary to other Transformers, BART accepts any type of noising / text corruption for training the model.
- It allows Token Masking, where certain tokens are replaced with <Mask>.
- It allows Token Deletion, where certain tokens are deleted. Here, the model must learn which inputs are missing, which is a more difficult task.
- Text Infilling is also allowed; here, multiple tokens are replaced with one <Mask> token
- Sentence permutation, where sentences are determined to be sentences based on a full stop token (.). The sentences are then shuffled randomly.
- Document rotation, where a random token is selected and where the document is then built around this token.
- This allows us to perform a variety of language objectives:
- Language Model objective, where we simply use BART as a Left-to-Right Transformer for language generation.
- Permuted Language Model, where 1/6 of the tokens are generated randomly.
- Masked Language Model, where <mask>-ed tokens are used and where their true contents must be predicted.
- Multitask Masked Language Model, where <masks> are also applied in the self-attention segments, making the task more complex.
- Masked Seq2Seq, where 50% of data is masked and where then a Seq2Seq model is trained for predicting the masked tokens.
Finetuning BART can be performed with a variety of downstream applications:
- Sequence calssification
- Token classification
- Sequence generation
- Machine translation
Various tasks can be used for this. BART was tested with:
- SQuAD: question answering
- MNLI: bitext classification for textual entailment.
- ELI5: long-form question answering.
- XSUM: news summarization
- ConvAI2: dialogue responses
- CNN/DM: news summarization
Results suggest that:
- Performance of the pretraining method varies across tasks. Language modeling works good on ELI5, and bad on SQuAD, to give just an example.
- Token masking is crucial, whether this is achieved through simple masking, self-attention masking or deletion. Rotation and permutation works less well.
- Text generation tasks benefit from LTR training.
- Question answering tasks seem to need Bidirectionality instead of simple LTR. Apparently, the purer the language model (i.e. the more text is masked), the better the model performs for question answering.
- Pretraining language objectives aren’t the only important factors driving results of a language model (i.e.. swapping a language task doesn’t have to be bad). Architectural components seem to drive performance too.
- At the time of writing the paper (2019), BART achieved most consistent performance across the range of tasks.
Benefits of BART compared to BERT/GPT:
- BART can be used with arbitrary noising scheme.
- It is a Seq2Seq model thus learns the original text better.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., … & Zettlemoyer, L. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.