How is the CLS / C output in BERT used in NSP?

Chris Staff asked 2 months ago
1 Answers
Best Answer
Chris Staff answered 1 month ago

Recall that in BERT, a C class output is added because a CLS class token is added to allow BERT to work with sentence-level and token-level tasks at once.
The NSP language objective is used by BERT to learn estimating whether one sequence can logically be in front of another sequence, or whether it’s bollocks (i.e. “Is this sentence the next sentence after reading this one? Yes/no”). This way, BERT can learn to distill sentence level information through the attention mechanism during pretraining.
Of course, NSP is performed by means of a loss. This loss function must ‘hatch onto’ some aspect of BERT. And this is the C token. From the paper: “C is used for next sentence prediction (NSP).”. The output of C and the corresponding class (0 for not next sentence and 1 for next sentence) can be compared with standard loss.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Your Answer

8 + 5 =