TensorFlow allows people for training their models distributed way. See https://www.tensorflow.org/guide/distributed_training. Can you explain easy what are TensorFlow distribution strategies, I don’t understand them fully.
In large-scale deep learning settings, running your training job on a single GPU might no longer be feasible. For example, if the VRAM required exceeds the amount being offered by your or even state-of-the-art GPUs, you will need multiple ones to distribute training to. What’s more, sometimes, it can be the case that you will need multiple machines, because even a combination of GPUs on one machine doesn’t suffice.
TensorFlow distribution strategies will help you here by distributing training across multiple GPUs, multiple machines and even TPUs. They allow you to perform distribution without having to change a significant amount of your code.
Currently supported within TensorFlow are the following distribution strategies: