What is the T5 Transformer and how does it work?

Chris Staff asked 11 months ago
1 Answers
Best Answer
Chris Staff answered 11 months ago

The Text-to-Text Transfer Transformer or T5 is a type of Transformer that is capable of being trained on a variety of tasks with a uniform architecture. It was created by Google AI and was published about in the paper “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer“. Here, we’ll take a look at T5 architecture, pretraining, finetuning — including variations and the conclusions that can be derived from them. It effectively summarizes the above linked paper.
Goal of creating T5:
* Can we create ONE uniform architecture with which we can learn MULTIPLE tasks, instead of using a task-specific architecture for varying types of language tasks?
T5 architecture:
* T5 uses a regular Vaswani et al. Transformer with a few exceptions:
* Its LayerNorm is simplified (rescaling activations without adding bias only).
* Its LayerNorm is outside the residual path.
* Dropout is added to the FeedForward segment, residual connection, attention subsegment, and I/O of the whole stack.
* Relative position encoding is applied instead of sine based encoding. Contrary to the latter, the first produces an embedding based on the offset between tokens A and B in the attention mechanism. In other words, the embedding differs with exactly the same tokens if they take different positions in the text. It is implemented by means of adding a scalar to the logit values (see for more details the paper linked above).
T5 pretraining (dataset / method):
* T5 is pretrained in a parallel way (both model and data parallelism) using Cloud TPU pods, i.e. multi-rack supercomputers. It used 1024 TPU chips in a 2D mesh interconnected by CPU hosts. Used Mesh TensorFlow for this purpose.
* T5 leveraged the Common Crawl dataset generated with web text. It stripped off nontext content from the HTML. The Common Crawl dataset contains 20TB of text each month. Heuristics defined as suitable by previous research were applied to clean this dataset, generating the Colossal Clean Crawled Corpus or C4. These were the heuristics used when pretraining T5:
* (1) Only include lines ending with terminal punctuation marks (being .,!?”).
* (2) Pages are discarded if they contain < 5 sentences, only retaining lines if >= 3 words are present.
* (3) Pages are discarded if they contain any word that is present on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”.
* (4) Any page with a line containing the word “Javascript” is removed, because it might be code that was not removed in the Common Crawl dataset.
* (5) Any page with a line containing the words “Lorem ipsum” is removed, because it might be the typical default text.
* (6) Any page containing curly brackets (i.e. { ) is removed, because it might be code.
* (7) Dataset deduplication occurs by removing all but one of any three-sentence spans occurring more than once (in other works, the Gettysburg Address was present multiple times, leading the model to memorize the text rather than learn to generate it).
* (8) Any page not being English with a > 99% probability is discarded.
* C4 was created because other datasets were either not available, not of sufficient scope, or were not filtered enough.
* The C4 dataset has a size of approximately 750GB of reasonably clean and natural text.
* The C4 dataset is released as part of the TensorFlow Datasets.
T5 finetuning (downstream tasks):
* T5 is finetuned on a variety of tasks from GLUE and SuperGLUE benchmarks as well as the CNN/DM dataset for summarization; the SQuAD dataset for question answering, and the WMT EN-FR, EN-GE, and EN-RO datasets for translation.
* These are the GLUE/SuperGLUE tasks used for finetuning:
* Sentence acceptability judgement (CoLA)
* Sentiment analysis (SST-2)
* Paraphrasing / sentence similarity (MRPC; STS-B; QQP)
* Natural language inference (MNLI, QNLI, RTE, CB)
* Coreference resolution (WNLI, WSC)
* Sentence completion (COPA)
* Word sense disambiguation (WIC)
* Question answering (MultiRC, ReCoRD; BoolQ)
T5 finetuning (text formatting / text structure):
* Finetuning treats all GLUE and SuperGLUE tasks as a single task by concatenating all datasets together.
* Training a single Seq2Seq architecture on a variety of tasks requires that tasks themselves are format in line with the architecture, rather than the architecture in line with the tasks. The authors call this format “text-to-text”.
* To specify the task that the model should perform, a task-specific prefix is added before feeding the text to the model. For example: “translate English to French: I go to the bakery” would need to produce “Je vais au boulanger”. More examples can be found in the paper (including the image that shows text structuring below).

T5 finetuning (process):
* T5 is finetuned with a maximum likelihood objective (teacher forcing) regardless of the task.
Experimental variations for studying T5 pretraining/finetuning:
* As a baseline model/procedure, the model is configured to be like BERTlarge in terms of number of encoder/decoder segments, hidden dimensionality, et cetera. The baseline further used crossentropy loss (given the teacher forcing objective) and AdaFactor optimization.
* Pretraining was performed for 2^19 steps before finetuning started. Each sequence was 512 tokens long at maximum, and batch size was set to 128 sequences. An inverse-square-root (1/sqrt(max(n,k)) Learning Rate schedule was used where n is the current iteration and k is the number of warmup steps. In other words, by setting k to 10^4 the LR was 0.01 for the first 100 steps, then decays exponentially.
* The baseline model was finetuned for 2^18 steps (with a constant 0.001 LR; 128-512-length batch sequences). Checkpoints were applied every 5.000 steps. Sentence piece embeddings were used for encoding the input.
* Here are the variations that were evaluated:
* (1) Alternative architectures: besides the classic Seq2Seq model, a language model and prefix language model architecture were also used.
* (2) Alternative unsupervised objectives: prefix LM, BERT-style, deshuffling, MASS-style, i.i.d. noise replace spans/drop tokens, random spans.
* (3) Alternative pretraining datasets: C4, Unfiltered C4, RealNews-like, Webtext-like, Wikipedia, Wikipedia + Toronto Books Corpus.
* (4) Alternative training strategy: updating all weights during finetuning vs more efficient approaches.
* (5) Alternative scaling policy: learning about the impact of the more data vs larger model vs both discussion present within the Transformer field these days.
Experimental findings:
* Text-to-text based Seq2seq learning is an easy way to generate a multitask model across task types (generation, classification, pseudo-regression). It shows comparable performance to task-specific architectures.
* Alternative architectures: the original Transformer encoder-decoder works best in text-to-text setting. Efficiency gains (such as parameter sharing) can be achieved battling the additional cost of using such a heavy architecture.
* Unsupervised objectives: there is no clear winner from the objectives mentioned above, and thus the authors advise to use an objective that minimizes target sequence length so that training is even more efficient.
* Alternative pretraining datasets: C4 wins vs the more filtered one, except for the more domain specific tasks. However, another finding suggests that when the dataset gets too narrow (niche-like), you risk that it gets too small, leaving performance to suffer.
* Alternative training strategies: updating all weights during finetuning trumps more efficient approaches despite the cost. Training multiple tasks at once did not yield performance boosts. Finetuning after pretraining on specific tasks did boost performance.
* Scaling: more data, a larger model, and ensembling your models – all techniques improve results. However, larger models trained with fewer steps > smaller models that have more data. Ensemble models improve the data IF pretrained AND finetuned separately, i.e. they cannot use the same pretrained model during finetuning.
T5 conclusions:
* Text-to-text based task formulation for generalizing Transformer architecture across language tasks works.
* Still, we face the inconvenience of large language models in settings where smaller / less expensive models are required. This is also true for settings where resources are low (i.e. where a smaller, custom dataset needs to be used for finetuning). The field requires methods for stronger performance with cheaper models. Some work is performed on this through distillation, parameter sharing, and conditional computation.
* More efficient knowledge extraction might be necessary (instead of corrupted text). This might avoid the need for training on 1 million tokens. Some work in this direction involves pretraining to distinguish between real and machine generated text.
* Formalizing task similarity can help us understand which unlabeled data must be used for a specific task (given the Alternative pretraining datasets findings from above).
* Language-agnostic models can greatly boost the field, i.e. models that can be used regardless of language.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … & Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.

Your Answer

16 + 8 =