Chris Staff asked 11 months ago
Chris Staff answered 11 months ago

Suppose that we have a loss metric $$L_{ft}(C)$$ which describes loss of the fine-tuning operation; more specifically, of the task performed (in the case of binary classification, this can for example be binary crossentropy loss).

Fine-tuning happens on a model that was first pretrained with an unlabeled dataset and hence in an unsupervised fashion, using some loss function $$L_u(C)$$. This loss function is a language modeling loss function.

Radford et al. (2018) show that performance of the fine-tuning operation improves even further in some cases (specific fine-tuning tasks; large fine-tuning datasets) when $$L_{ft}(C)$$ is added to $$L_u(C)$$:

$$L_{combined}(C) = L_{ft}(C) + \lambda \times L_{u}(C)$$

Here, the $$\lambda$$ serves as a weight for the language modeling loss addition.

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.