Subtitles section Play video
[MUSIC PLAYING]
ZONGWEI ZHOU: Good morning and good afternoon, researchers
and TensorFlow developer.
My name is Zongwei Zhou.
I'm from TensorFlow performance team.
Today, I'm going to tell you more about TensorFlow
distributed--
training.
Let me start the talk by sharing a piece of user experience
that you might already encountered before.
Suppose that you have an awesome prototype machine learning
model that you worked so hard to make
it run efficiently on single host with multiple GPUs.
Now, it's time to really get it running end
to end with some more resources.
Then, you started your four of your beefy cloud
virtual machines, each with multiple modern GPUs
with connected fancy 100 gigabits network.
With all these resources, you deploy your models
and hope to see it run blazingly fast.
But wait, why am I only getting 1.5X faster than a single
machine while I am using four of them?
Yes, I know.
This is really frustrating.
If you have similar experience and questions,
this talk is for you.
In today's talk, you are going to learn
how to scale out your TensorFlow to Keras model
to multiple machine multiple GPU.
And using the optimization, we will ship in
the upcoming TensorFlow 2.2 release.
You are going to dramatically improve
your training throughput.
I use one case study, BERT SQuAD, for today's talk.
BERT, of course, it has to be this revolutionary NLP
model, which is very popular in the past year.
I used the fine tuning task, which
is the SQuAD reading comprehension task,
for example.
Your training model is going to be given some reading materials
and questions.
And your model needs to answer the questions.
And then the context can be evaluated with the accuracy.
The model we use today available in TensorFlow to Official Model
Garden, where you could go and download via this link,
and try it out following.
And here is a quick demonstration of the BERT SQuAD
training throughput.
Starting from the first line, this
is the training throughput using TensorFlow
2.1 one out of the box.
With three optimizations in TensorFlow 2.2
that I'm going to introduce today,
the model is now running 2 point X faster.
Aren't you excited and want to see these optimizations?
But before diving into the sizing optimization,
let me provide some background information to introduce
how the model is synchronously trained on multiple hosts,
multiple GPU.
And here, we are leveraging the native TensorFlow disputed
training support, which is multi-worker mirror strategy.
In the figure here, we have two GPU device
on two different hosts.
And we are using a simple deep learning
model, which have two layer, A and B, and with one variable
each.
So each GPU receive a subset of the training data
and compute the forward path using its local copy
of model variables.
And then, it runs through the backward path
to calculate the gradients of each layer.
After the gradient are calculated,
all the devices now start communicating
between themselves using allreduce algorithm
to aggregate the gradients.
Up til gradient aggregation, each device
would get back the same set of the aggregated gradients
and then use that to update their local variables.
We call it synchronous, because every device
needs to aggregate a gradient, get
the same set of the aggregated gradient,
update their variables, before they can actually
proceed to the forward path of the next training step.
So with so many phases in the deep learning training,
how do we actually identify where the bottleneck is?
Is it in forward path, backward path, gradient aggregation,
or variable updates, how do we know where to optimize?
Yes, of course, we are going to use
TensorFlow profiler, which Quimin just
introduced to you today.
So let's start with the run of BART SQuAD model,
log in to 1 VN, fire up the TensorBoard, take a profile,
and then open the TraceViewer.
And here is the trace you will get from the TraceViewer.
In this view, you see there's a pink bar
with number five at the bottom of the profile, which
means that this is the number five training
step in your profile.
And this is the whole operations within this training step.
And the first five rows are events that is related to GPU.
Specifically, in the first row, you
will see a GPU computation, which
includes forward path, backward path, and variable updates.
And you can also see a big blue bar
called NCCL allreduce kernel, which
is the gradient aggregation.
It's called NCCL because, including TensorFlow,
every machine learning framework is using NVIDIA NCCL allreduce
library to aggregate gradients.
So by far, you may already spot the problem, right.
This is exactly the power of the TensorFlow profiler, which make
it so visualized and obvious.
So the problem here we are facing
is that the gradient aggregation can actually
dominate the entire step time.
All you see is this big blue box of NCCL allreduce.
So how do we resolve this issue and optimize the training
performance?
Here comes the three optimizations.
First optimization from the profiler,
you can get the information regarding the total time
use in NCCL allreduce.
And you know your model, you know the total size
of the model variables and gradients.
And from these three information,
you can calculate your NCCL allreduce throughput.
And you also know you use how many
machines in each machine, how many GPUs are there.
Use these two informations, and following the NCCL NVIDIA
tuning guideline, you can calculate
the ideal NCCL throughputs.
And in this case, we found out that the real NCCL allreduce
throughput is actually much smaller
than the ideal throughput.
Then, we know that the first optimization
is to actually tune your NCCL to fully utilize
the underlying cross-host network
to improve the performance.
Second optimization, so if you notice the big blue bar here,
it says at 32, which means that the gradient aggregation is
in the four precision float32 format.
And we know that the model can be trained
with mixed precision, so the gradient
can actually be in lower precision live float16.
So can we actually aggregate the gradient
in float16, which would efficiently cut the network
data that is being transferred by half and last
improve the performance?
The third one, if you noticed, the NCCL
allreduce it change data across the network.
And it doesn't even use the GPU itself a lot.
So you see an empty space out there about the blue bar.
So can we actually push the NCCL allreduce forward
a little bit, so that it overlap with the GPU computation
happening in backward path?
This way we can reduce the GPU idle time
and improve the training step time.
So here is the three ideas, and they all look good.
And I'm going show how to implement
this idea using TensorFlow 2.2.
First optimization, NCCL throughput tuning.
So TensorFlow 2.2 is shipped with the latest NVIDIA NCCL
libraries.
And we have a lot of experiments on GoogleCloud VMs
to identify a set of recommended parameters
to help you reach the peak throughput.
Users can append those parameters
when running their models, such as the NCCL Socket NThreads
parameter here, append it before the model.main.py.
So if users has different network environment
than the Cloud VMs, you might need
to run the experiment to find out the optimum parameters.
And this is suboptimal, and we are looking to improve that.
We are working with NVIDIA to see if we can autotune the NCCL
parameters.
So in future TensorFlow release, user
doesn't manually test and set any of these parameters.
For now, with the optimal NCCL parameters,
we see a dramatic 30% to 40% throughput improvement
in BART SQuAD.
These are good for just one optimization.
So are you excited to see the second optimizations?
OK, here it comes.
If you remember, we were looking to aggregate the gradients
in lower precision, which is float16.
The prerequisite is that the model is already trained
with Keras mixed precision API.
And we have an online tutorial, following this link,
to tell you about the technical details and usage of this API.
Generally speaking, mixed precision
would use two types of floating point number representation
during training, float32 and float16.
If you can see from the figure, float32 with more data,
it can represent a larger range of numbers with high precision
than float16.
But float16 has its own advantage.
Computing flaot16 in modern GPU can be up to 2 to 4X
faster than a FP32 computation.
So this is really a trade-off between the model accuracy
and the computing efficiency.
But mixed precision API actually try to give you
the best of both world.
To show that the mixed precision under the hood
I will show the BART SQuAD custom training loop codes.
With this code, user can get the maximum flexibility
to change every aspect of the training.
And here is the training loop code in TensorFlow 2.1.
With mixed precision, your model variable
are still kept in float32 for best model accuracy.
But the computation, including the grading computation here,
are all done in float16.
And up to the gradient computation,
the gradients are converted back to a FP32,
and then apply the gradient updates
to the FP32 model variables.
The applied gradients API here would actually
implicitly aggregate the FP32 gradient for you.
This is why you see the gradient is aggregated in FP32
from the previous profile.
With TensorFlow 2.2, now you can change the custom training loop
code a bit to enable gradient aggregation in float16.
First, we modify the optimizer apply gradient API.
In our text and argument, called all_reduce_sum_gradients.
When you set this argument to force,
it essentially tell the optimizer API
not to do the gradient aggregation for you.
Instead, the user can manually call gradient aggregation API
from distribution strategy to do the allreduce by yourself.
And the best thing is that the user can apply any customized
gradient operations before and after allreduce
using this manual allreduce, including
what we want to do for gradient aggregation in float16.
So you simply just need to cast the gradient to FP16
before the allreduce, do the gradient allreduce,
and then cast the gradient back to FP16.
So this is as simple as what you need to do.
And as you can see, you may find a lot of gradient
casts here to flow in between FP32 and FP16, which is OK.
Because the TensorFlow graph optimization will actually
remove these redundant casts under the hood.
So you only leave with one gradient
cast from FP16 to FP32, which anyway you have to do,
because you are applying gradient to the FP32 model
variables.
So the cast here is just more for your code readability
but will not downgrade your performance.
Also I want to make one more point.
For advanced user that wants to customize gradient operation,
including allreducing FP16, they can
use the Custom Training Loop to get maximum flexibility.
But for average user who just want the allreducing FP16 out
of the box, so we are working on supporting
this in a Keras Compile and Fit in upcoming future releases.
So in the future release, you can
get allreducing FP16 out of the box using Keras Compile and Fit
API.
So with FP16 allreduce, we are now
seeing that the BART SQuAD training throughput is further
increased by about 35%.
We are now almost 2.2X throughput comparing with
the TensorFlow 2.1 by far.
But wait, we still got one more optimization.
The third optimization, if you still remember,
we want to reduce the GPU idle time by overlapping gradient
aggregation with the backward computation on GPU.
Let's take a look at the deep learning model
figure we have seen previously.
So in this model, after the TensorFlow
calculate the gradient of your Layer B,
we can immediately send out the gradient of the Layer B
for aggregation.
And at the same time, on your GPU,
you can still calculated the gradient of Layer A.
So as you can see right now, the network computation
of allreduce the gradients of Layer B
is now run in parallel with the gradient calculation
in layer A, which we can use the GPU resources in network
both efficiently.
To enable this overlap, let's meet some more changes
to the same custom training loop code in previous optimization.
So in TensorFlow 2.2, we introduce these collected hints
to the distribution strategy allreduce API
by inputting a bytes per pack arguments to the allreduce API.
It tells the TensorFlow to break your model gradients
into multiple packs.
And the TPU TensorFlow runtime, we actually
send our pack once this pack of gradient
is available in the backward computation.
So this is as easy as it looks, just
giving it a bytes per pack numbers, right.
But the next question would be, what
is the optimum bytes per pack to achieve the maximum training
throughput?
In TensorFlow 2.2, user needs to do some simple experiment
to identify the optimum parameter.
NVIDIA provide the official NCCL allreduce benchmarks,
user can get these benchmarks running on their multi-hosts
with their networking environment.
And they need to change the data, allreduce data size.
And typically, what user will see along
with increase of the allreduce data size,
so does the NCCL allreduce throughput,
it would also increase with the data size.
But up to a certain allreduce data size,
the NCCL throughput will start to plateau,
which means that your NCCL reached the limits
of your underlying network.
And here, the data range of the data size
is exactly the optimum pack size,
which is sufficiently large to saturate your underlying
network.
If you set the pack size to be smaller,
each gradient pack you see now cannot fully utilize
the network.
So you are wasting your network bandwidth.
If you set the gradient pack larger,
then it means that TensorFlow needs
to wait for longer time to actually waiting
for the first pack of the gradient,
it means that you have less overlaps.
So we know that the optimum pack size is actually
the allreduce pack size that reached the throughput plateau.
But this seems to-- still some work
that require from the user.
So in future release, we are also
working on autotuning this pack size
so the user can get the optimum performance out of the box.
So by far, we have seen these three optimizations.
And it improves the BART SQuAD training throughput by 2.5X.
And these optimizations are very useful in public Cloud
when the networking between the Cloud VMs is limited.
We are also working on more improvements, as I introduced,
along with the talk.
We are going to support Keras Compile
and Fit, support autotune in NCCL parameters,
and the pack size.
All coming in the future releases of TensorFlow.
So stay tuned for all and more improvements.
[MUSIC PLAYING]