1 Answers
Best Answer
Generally, much better. They can be trained with bigger architectures and with bigger datasets, leading to extreme performance boosts compared to LSTMs. For example, the largest models that are out there these days have billions of parameters, compared to a few hundreds of thousands to a few millions for LSTMs.
The reason why these perform so well over LSTMs is because they learn lingual patterns differently. Whereas LSTMs perform a sequential operation over input sequences, adapting memory on the fly with forget, update and output gates, Transformers can process sequences in parallel by means of self-attention.
More information here.
Your Answer