BERT adds a class token or “CLS” to every input sequence, which is converted into a class output token or “C” upon output prediction.
This class output token captures phrase-level information thanks to interaction with all the other inputs in the attention mechanism in BERT’s encoder segments.
Of course, the individual output tokens capture token-level information.
When finetuning, you train the model to behave well on a particular task. This can indeed be a token- and phrase-level task.
- If you want to train on a token-level task, you can use the BERT architecture and work with all the individual token outputs during fine-tuning.
- If you want to train on a phrase-level task, you can use this “C” output token and use its output for further fine-tuning.
This way, BERT both works on sentence- and word-level tasks without the need for using different architectures.
At least, this is what was suggested in the BERT paper.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.