The paradigm of pretraining a model on unsupervised data with subsequent fine-tuning on labeled data is very successful in many NLP tasks. Many models follow the BERT architecture and thus, through its dependency on the original Vaswani architecture, on multi-headed self attention.
1. Multi-headed self-attention is global in theory. That is, the attention blocks should capture global patterns from text. In practice, however, it is often found that these segments mostly learn local patterns. This is problematic, especially given the fact that these segments are intense in terms of required computing power.
2. Removing some attention heads during finetuning does not degrade performance.
In other words, they argue that there is a heavy computational redundancy involved with training BERT-like models.
As a result, the authors introduce Convolutional BERT (ConvBERT). It improves the original BERT by replacing some Multi-headed Self-attention segments with cheaper and naturally local operations, so-called span-based dynamic convolutions. These are integrated into the self-attention mechanism to form a mixed attention mechanism, allowing Multi-headed Self-attention to capture global patterns; the Convolutions focus more on the local patterns, which are otherwise captured anyway. In other words, they reduce the computational intensity of training BERT.
Why is ConvBERT necessary?
– If you look at attention maps, you see a diagonal pattern. This indicates that for a word, it attends mostly to itself – and to some words next to the word. This suggests that attention is primarily LOCAL, despite the usage of a GLOBAL mechanism i.e. the multi-headed self-attention segments. We might thus not need the global mechanism. This has a significant impact on model efficiency because the global mechanism has quadratic cost with linear increase in tokens.
– Adding more naturally local mechanisms can ensure that this quadratic complexity turns into linear complexity, which greatly benefits model training.
The architecture of ConvBERT and architectural changes compared to BERT look as follows:
– Span-based dynamic convolutions: first of all, span-based dynamic convolutional layer is used together with multi-head attention (in fact, it replaces some attention segments). Contrary to classic convolutional layers, which generate a set of kernels given their convolution over the input sample, dynamic convolutions generate one kernel per input token. However, this means that for any token (e.g. a word) a unique kernel is generated, ignoring contextual dependencies between words in phrases. This is why instead, spans are fed to the dynamic convolutions, which thus introduce context dependencies. This is why it is called a span-based dynamic convolutional layer.
– Bottleneck structure: second of all, the number of attention heads is reduced by introducing a bottleneck structure in the embeddings part of the BERT architecture. Input tokens are embedded to a lower dimensional space before they are fed to the mixed attention segment. This relieves computational redundancy in both embeddings and in the downstream attention segments. A parameter γ is introduced for reducing the dimensionality. γ is always > 1. It impacts the dimensionality of the Q, K, V mapping of the attention segment AND the number of attention heads, which is also reduced.
– Grouped linear operator: the fully-connected feedforward network in the BERT architecture is responsible for the majority of parameters. The inner layer dimension of the feedforward segment has 4x higher dimensionality compared to its input and output dimensionalities. ConvBERT introduces a ‘grouped linear operator’, which splits inputs into multiple groups and then processes them independently. The result is then concatenated.
Span-based dynamic convolutions are grouped with classic multi-headed self-attention segments in a concatenation operation:
Concat(Self-attention(Q, K, V); SDConv(Q, Ks, V; Wt)).
Here, Q, K and V are the queries, keys and values generated for self-attention. As you can see, queries and values are also used within the SDConv segment. It generates its own keys and in addition has weights Wt that are responsible for producing the kernel given the input span.
In other words, ConvBERT is built by stacking encoding segments, just like in BERT – but then with mixed attention, a bottleneck structure regarding input embeddings, and a grouped linear operator to make the feedforward segment in the encoding segment more efficient. The authors conjecture that this significantly speeds up learning and thus might allow them to obtain state-of-the-art (SOTA) results.
ConvBERT Experimental Setup:
What language tasks, datasets, experiments does ConvBERT use to evaluate the model and the contributions of the individual segments? What are ConvBERT ablation studies performed?
– Evaluation is performed on the ‘replaced token detection pretraining task’.
– ConvBERT is evaluated with a variety of model sizes.
* Small: hidden dimensionality 256, word embedding dimensionality 128, feedforward module dimensionality 1024 (4x hidden dimensionality), 12 layers, 4 attention heads.
* Medium: hidden dimensionality 384, 8 attention heads.
* Base: common BERT config, hidden dimensionality 768, 12 layers, 12 attention heads, γ = 2 meaning a reduction in heads.
– Batch size is 128 for ConvBERTsmall and 256 for ConvBERTmedium and ConvBERTbase.
– Input sequences are 128 tokens long.
– Evaluation happens on the GLUE benchmark (with a variety of language tasks) and the SQuAD dataset (for question answering). This is performed with a config like the ELECTRA model for fair comparison. Comparisons are also made with knowledge destillation based models like TinyBERT, MobileBERT and DistilBERT. The model is finetuned per task.
– As an ablation study, the model starts with the original BERT architecture, then applies and investigations (combinations of) bottleneck structure, span-based dynamic convolutions, grouped linear operator, to see how much each improvement contributes to the overall results. In addition, the work increases hidden dimension size to show performance gains with higher dimensionality.
ConvBERT Results / Performance:
How well does ConvBERT perform? How does this compare to other Transformer models?
– ConvBERT achieves 86.4 score on GLUE => 5.5 higher than BERTbase, 0.7 higher than ELECTRAbase, while requiring less training cost & parameters. In other words ,the ConvBERTsmall performs equally to / better than other small-sized models with more representation power. ConvBERTsmall and ConvBERTmedium also outperform baseline ELECTRAsmall and comparpes to BERTbase. MobileBERT etc. are much better but this is likely due to the fact that they were obtained through knowledge distillation and architecture search. ConvBERTbase outperforms these in return when looking at training cost.
– The bottleneck structure in input embeddings can reduce the number of parameters (and thus computational cost) without hurting performance. It can even be beneficial in a small setting (e.g. with ConvBERTsmall), possibly because attention heads are then required to learn more compact representations.
– Playing with span-based dynamic convolutions (e.g. with kernel size) suggests that larger kernels perform better IF their receptive fields do not capture the whole input sequences. In that case, a context issue (like a context leak to all kernels) occurs.
– Adding spans to the dynamic convolutions definitely improves performance; without them they do not improve the results too much.
Jiang, Z., Yu, W., Zhou, D., Chen, Y., Feng, J., & Yan, S. (2020). Convbert: Improving bert with span-based dynamic convolution. arXiv preprint arXiv:2008.02496.