Optimizing TensorFlow Training Time for Better Performance
Maximize your accelerator utilization for reduced training time to minimize costs and improve metrics across the board
With different methods and profiling, we can achieve a significantly higher training throughput. The topics discussed in this guide are focused on TensorFlow. But no worries. Almost all of these optimizations and methods also exist for PyTorch. Let’s begin.
Numbers don’t lie. The following graph is an example of what is possible. Without optimization, the training is processing around 60 images per second compared to almost 2000 images per second when optimized. This optimization can be applied to any kind of deep learning problem.
This is a road you have to follow from left to right. You can’t just simply run your training operations for your GPUs and TPUs in f.loat-16. It will also require an efficient data pipeline. If you plan to optimize your training based on this guide, take the steps from left to right.
We can optimize both training and inference. However, this article focuses solely on optimizing training time performance. If you are interested in how to optimize inference let me know in the comments below or via a social media channel of your preference. Without…