Optimizing TensorFlow Training Time for Better Performance

Maximize your accelerator utilization for reduced training time to minimize costs and improve metrics across the board

Sascha Heyer
5 min readAug 20, 2021


With different methods and profiling, we can achieve a significantly higher training throughput. The topics discussed in this guide are focused on TensorFlow. But no worries. Almost all of these optimizations and methods also exist for PyTorch. Let’s begin.

Numbers don’t lie. The following graph is an example of what is possible. Without optimization, the training is processing around 60 images per second compared to almost 2000 images per second when optimized. This optimization can be applied to any kind of deep learning problem.

Training optimization possibilities based on an image example — Author: Sascha Heyer

This is a road you have to follow from left to right. You can’t just simply run your training operations for your GPUs and TPUs in f.loat-16. It will also require an efficient data pipeline. If you plan to optimize your training based on this guide, take the steps from left to right.

We can optimize both training and inference. However, this article focuses solely on optimizing training time performance. If you are interested in how to optimize inference let me know in the comments below or via a social media channel of your preference. Without further ado, let’s get started.

Identifying Training Time Performance Bottlenecks: Don’t Guess, Measure

You can use the TensorFlow Profiler (part of TensorBoard) to find bottlenecks, understand the hardware resource consumption, and get the most out of your GPUs. The profiler shows a timeline of each Op which helps to identify if the training needs to wait for data to be fetched.

The TensorFlow Profiler should be the first tool in your toolbox when it comes to saving costs and optimizing performance. We also can identify parts of the graph that perform slowly by analyzing the GPU stream. The Profiler also can be used to trace inference requests that take longer than expected.

I strongly recommend checking out the TensorFlow step-by-step guide on GPU performance analysis.

Efficient Data Pipelines



Sascha Heyer

Hi, I am Sascha, Senior Machine Learning Engineer at @DoiT. Support me by becoming a Medium member 🙏 bit.ly/sascha-support