In BERT, what are Token Embeddings, Segment Embeddings and Position Embeddings?

Ask Questions Forum: ask Machine Learning Questions to our readersCategory: Deep LearningIn BERT, what are Token Embeddings, Segment Embeddings and Position Embeddings?
Chris Staff asked 11 months ago
2 Answers
Best Answer
Chris Staff answered 11 months ago

BERT has a unique way of processing tokenized inputs. As with any model, phrases are first tokenized and each individual token is added to the list containing all sequence tokens to be processed.
 
However, due to BERT’s way of processing (it utilizes self-attention which allows for parallel processing thus no native sequential information being present within the tokens; it also requires two sentences to be present within the input due to its NSP task) tokens must be preprocessed to add all the necessary information.
 
In the example below, you see two input sentences (“All ok?” and “Yes”) structured according to the BERT input, with the CLS token as well as the SEP tokens.
 
First of all, the token is fed through the embedding layer, which yields a token embedding. This token embedding, although a lower-level representation that is still very informative, does not yield position information. This is added by means fo a position embedding, like we know from the vanilla Transformer by Vaswani et al.
 
Then, finally, we also must know whether a particular token belongs to sentence A or sentence B in BERT. We can achieve this by generating another, fixed token, called the segment embedding – a fixed token for sentence A and one for sentence B.
 
Preprocessing the input for BERT before it is fed into the encoder segment thus yields taking the token embedding, the segment embedding and the position embedding and adding them altogether. What’s fed into BERT then contains information about the token itself, about its position in the phrase and whether it belongs to sentence A or sentence B.
 
Source:
 
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
 

lawsonabs answered 3 months ago

the segment embedding is a learned embedding.  

Your Answer

14 + 17 =