Definitely. Self-attention is essentially a function that is capable of generating a supervision signal for unlabeled data. By consequence, language models are now capable of learning language patterns from unlabeled data – a form of self-supervised learning that all happens in the pretraining stage. These patterns are general. We can subsequently fine-tune these pretrained models to adapt them to specific language tasks, using labeled datasets.