How does BERT separate token- and phrase-level tasks?

Ask Questions Forum: ask Machine Learning Questions to our readersCategory: Deep LearningHow does BERT separate token- and phrase-level tasks?
Chris Staff asked 11 months ago
1 Answers
Best Answer
Chris Staff answered 11 months ago

BERT adds a class token or “CLS” to every input sequence, which is converted into a class output token or “C” upon output prediction.
 
This class output token captures phrase-level information thanks to interaction with all the other inputs in the attention mechanism in BERT’s encoder segments.
 
Of course, the individual output tokens capture token-level information.
 
When finetuning, you train the model to behave well on a particular task. This can indeed be a token- and phrase-level task.
 

  • If you want to train on a token-level task, you can use the BERT architecture and work with all the individual token outputs during fine-tuning.
  • If you want to train on a phrase-level task, you can use this “C” output token and use its output for further fine-tuning.

 
This way, BERT both works on sentence- and word-level tasks without the need for using different architectures.
 
At least, this is what was suggested in the BERT paper.
 
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
 

Your Answer

11 + 8 =