15 percent of the tokens is masked at random. This is further distributed as follows:
- 80 percent of the 15 percent is masked with a <MASK> token.
- 10 percent of the 15 percent is masked by picking another token at random.
- 10 percent of the 15 percent is masked by not changing the token, keeping it at the original one.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.