分佈式TensorFlow (TensorFlow @ O'Reilly AI Conference, San Francisco '18) (Distributed TensorFlow (TensorFlow @ O’Reilly AI Conference, San Francisco '18))

Subtitles section Play video

PRIYA GUPTA: Let's begin with the obvious question.
Why should one care about distributed training?
Training complex neural networks with large amounts of data
can often take a long time.
In the graph here, you can see training
the resident 50 model on a single but powerful GPU
can take up to four days.
If you have some experience running complex machine
learning models, this may sound rather familiar to you.
Bringing down your training time from days to hours
can have a significant effect on your productivity
because you can try out new ideas faster.
In this talk, we're going to talk
about distributed training, that is running training in parallel
on multiple devices such as CPUs, GPUs, or TPUs
to bring down your training time.
With the techniques that you-- we'll talk about in this talk,
you can bring down your training time from weeks or days
to hours with just a few lines of change of code
and some powerful hardware.
To achieve these goals, we're pleased to introduce
the new distribution strategy API.
This is an easy way to distribute your TensorFlow
training with very little modification to your code.
With distribution strategy API, you no longer
need to place ops or parameters on specific devices,
and you don't need to restructure a model in a way
that the losses and gradients get aggregated correctly
across the devices.
Distribution strategy takes care of all of that for you.
So let's go with what are the key goals of distribution
strategy.
The first one is ease of use.
We want you to make minimal code changes in order
to distribute your training.
The second is to give great performance out of the box.
Ideally, the user shouldn't have to change any--
change or configure any settings to get the most performance out
of their hardware.
And third we want distribution strategy
to work in a variety of different situations,
so whether you want to scale your training
on different hardware like GPUs or TPUs
or you want to use different APIs like Keras or estimator
or if you want to run distributed--
different distribution architectures
like synchronous or asynchronous training,
we have one distribution strategy to be useful for you
in all these situations.
So if you're just beginning with machine learning,
you might start your training with a multi-core CPU
on your desktop.
TensorFlow takes care of scaling onto a multi-core CPU
automatically.
Next, you may add a GPU to your desktop
to scale up your training.
As long as you build your program with the right CUDA
libraries, TensorFlow will automatically
run your training on the GPU and give you a nice performance
boost.
But what if you have multiple GPUs on your machine,
and you want to use all of them for your training?
This is where distribution strategy comes in.
In the next section, we're going to talk
about how you can use distribution strategy to scale
your training to multiple GPUs.
First, we'll look at some code to train the ResNet 50
model without any distribution.
We'll use a Keras API, which is the recommended TensorFlow
high level API.
We begin by creating some datasets
for training and validation using the TF data API.
For the model, we'll simply reuse
the ResNet 50 that's prepackaged with Keras and TensorFlow.
Then we create an optimizer that we'll be using in our training.
Once we have these pieces, we can compile the model providing
the loss and optimizer and maybe a few other things
like metrics, which I've omitted in the slide here.
Once a model's compiled, you can then begin your training
by calling model dot fit, providing the training
dataset that you created earlier, along with how many
epochs you want to run the training for.
Fit will train your model and update the models variables.
Then you can call evaluate with the validation dataset
to see how well your training did.
So given this code to run your training
on a single machine or a single GPU,
let's see how we can use distribution strategy
to now run it on multiple GPUs.
It's actually very simple.
You need to make only two changes.
First, create an instance of something called
mirrored strategy and second pass the strategy instance
to the compile call with the distribute argument.
That's it.
That's all the code changes you need
to now run this code on multiple GPUs using distribution
strategy.
Mirror strategy is a type of distribution strategy API
that we introduced earlier.
This API is available intensive on point 11 release,
which will be out very shortly.
And in the bottom of the slide, we've
linked to a complete example of training [INAUDIBLE]
with Keras and multiple GPUs that you can try out.
With mirror strategy, you don't need
to make any changes to your model code or your training
loop, so it makes it very easy to use.
This is because we've changed many underlying components
of TensorFlow to be distribution aware.
So this includes the optimizer, batch norm layers, metrics,
and summaries are all now distribution aware.
You don't need to make any changes to your input pipeline
as well as long as you're using the recommended TF data APIs.
And finally saving and checkpointing work
seamlessly as well.
So you can save with no or one distribution
strategy and a store with another seamlessly.
Now that you've seen some code on how
to use mirror strategy to scale to multiple GPUs,
let's look under the hood a little bit
and see what mirror strategy does.
In a nutshell, mirror strategy implements data parallelism
architecture.
It mirrors the variables on each device EGPU
and hence the name mirror strategy,
and it uses AllReduce to keep these variables in sync.
And using these techniques, it implements
synchronous training.
So that's a lot of terminology.
Let's unpack each of these a bit.
What is data parallelism?
Let's say you have end workers or end devices.
In data parallelism, each device runs the same model
and computation but for the different subset
of the input data.
Each device computes the loss and gradients
based on the training samples that it sees.
And then we combine these gradients
and update the models parameters.
The updated model is then used in the next round
of computation.
As I mentioned before, mirror strategy mirrors the variables
across the different devices.
So let's say you have a variable A your model.
It'll be replicated as A0, A1, A2, and A3
across the four different devices.
And together these four variables conceptually
form a single conceptual variable
called a mirrored variable.
These variables are kept in sync by applying identical updates.
A class of algorithms called AllReduce
can be used to keep variables in sync
by applying identical gradient updates.
AllReduce algorithms can be used to aggregate the gradients
across the different devices, for example,
by adding them up and making them available on each device.
It's a fused algorithm that can be very efficient
and reduce the overhead of synchronization by quite a bit.
There are many versions of algorithm--
AllReduce algorithms available based
on the communication available between the different devices.
One common algorithm is what is known as ring all-reduce.
In ring all-reduce, each device sends a chunk of its gradients
to its successor on the ring and receives another chunk
from its predecessor.
There are a few more such rounds of rate and exchanges,
and at the end of these exchanges,
each device has received a combined
copy of all the gradients.
Ring-all reduce also uses network bandwidth optimally
because it ensures that both the upload and download bandwidth
at each host is fully utilized.
We have a team working on fast implementations of all
reduce for various network topologies.
Some hardware vendors such as the Nvidia
provide specialized implementation
of all-reduce for their hardware, for example,
Nvidia [INAUDIBLE].
The bottom line is that AllReduce can be fast
when you have multiple devices on a single machine
or a small number of machines with strong connectivity.
Putting all these pieces together,
mirror strategy uses mirrored variables and all
reduce to implement synchronous training.
So let's see how that works.
Let's say you have two devices, device 0 and 1,
and your model has two layers, A and B. Each layer has
a single variable.
And as you can see, the variables
are replicated across the two devices.
Each device received one subset of the input data,
and it computes the forward pass using its local copy
of the variables.
It then computes a backward pass and computes the gradients.
Once agreements are computed on each device,
the devices communicate with each other
using all reduce to aggregate the gradients.
And once the gradients are aggregated,
each device updates its local copy of the variables.
So in this way, the devices are always kept in sync.
The next forward pass doesn't begin
until each device has received a copy of the combined gradients
and updated its variables.
All reduce can further optimize things and bring down
your training time by overlapping computation
of gradients at lower layers in the network with transmission
of gradients at the higher layers.
So in this case, you can see--
you can compute the gradients of layer A
while you're transmitting the gradients for layer B.
And this can further reduce your training time.
So now that we've seen how mirror strategy looks
under the hood, let's look at what type of performance
and scaling you can expect when using
mirror strategy with multi-- for multiple GPUs.
We use a ResNet 50 model with ImageNet dataset
for our benchmarking.
It's a very popular benchmark for performance measurement.
And we use Nvidia Teslas V100 GPUs on Google Cloud.
And we use a bat size of 128 per GPU.
On the x-axis here, you can see the number of GPUs,
and on the y-axis, you can see images per second process
during training.
As you can see, as we increase the number of GPUs
from one to two to four to eight,
the images per second processed is
close to doubling every time.
In fact, we're able to achieve 90% to 95% scaling out
of the box.
Note that these numbers were obtained by using the ResNet 50
model that's available in our official model garden depot,
and currently it uses the estimator API.
We're working on Keras performance actively.
So far, we've talked a lot about scaling onto multiple GPUs.
What about cloud TPUs?
TPU stands for a tensor processing units.
These are custom ASIC, designed and built by Google
especially for accelerating machine learning workloads.
In the picture here, you can see the various generations
of TPUs.
On the top left, you can see TPUE1.
In the middle you can see cloud TPUE2,
which is now generally available in Google Cloud.
And on the right side you can see
TPUE3, which was just announced in Google I/O a few months ago
and is now available in alpha.
And in the bottom of the slide, you
can see a TPU pod, which is a number of cloud TPUs
that are interconnected to each other using a custom network.
TPU pods are also now available in alpha.
So if you want to learn more about TPUs,
please attend Frank's talk tomorrow on cloud TPUs.
In this talk, we're just going to focus
on how you can use distribution strategy to scale
your training on TPUs.
It's actually very similar to what we just
saw with mirror strategy, but instead we'll
use TPU strategy this time.
So first you create an instance of a TPU cluster resolver
and give it the name of your cloud TPU resource.
Then you pass the clusters over to the TPU strategy constructor
along with another argument called steps per run, which
I'll come back to in a bit.
That's it.
Once you have the strategy instance,
you can pass it to your compile call as before,
and your training will now run on cloud TPUs.
So you can see, the distribution strategy
makes it really easy to switch between different types
of hardware.
This API will be available in the next TensorFlow
release, which is 1.12.
And in the bottom of the slide, we've
provided a link to training ResNet 50
with the estimator API using TPU strategy.
So let's talk a little bit about what TPU strategy does.
TPU strategy implements the same architecture
as mirror strategy.
That is it implements data parallelism
with synchronous training.
The cores on a TPU, there are eight cores
on a single cloud TPU.
And these cores are connected via fast interconnects.
And this means that you can do AllReduce really fast
to aggregate the gradients.
Coming back to those steps per run parameter
from the previous slide, for most models
the computation time of a single step
is small compared to the sum of the communication overheads.
So it makes sense to run multiple steps at a time
to amortize these overheads.
So setting this number to a high value like 100
will give you the best performance out of the TPUs.
The TPU teams are working on reducing these overhead so
that in the future you may not need to specify
this argument anymore.
And finally you can also use TPU strategy
to scale to cloud TPU pods, which are, as I mentioned,
in alpha release right now.
TPU pods consist of many clouds TPUs interconnected
via fast network.
And this means that AllReduce across these different all TPU
pods can be really fast as well.
So that's all about cloud TPUs.
I'll hand it off to my colleague Magnus to talk about scaling
onto multi-node with GPUs.
MAGNUS HYTTSTEN: Thank you.
So that was how we scale on multiple GPU cores
from the single node.
What about multiple nodes the way we have multiple computers?
Because the fact is that even though you
can cram in a lot of GPU cards, for example,
on a single computer, sooner or later, if you
do massive amounts of training, you
will need to consider an architecture where
you can scale out the multiple nodes as well.
So this is an example where we see four worker nodes with four
GPU cards in each of them.
In terms of support for multi-GPU--
multi-node support, we have currently
support for premade estimators in terms of [INAUDIBLE] 1.11,
which is subject to be released shortly.
And we are working very, very hard
with some awesome developers to get this support into Keras
as well.
So you should be aware that Keras support will
be there as soon as possible.
However, if you do want to use Keras
with a multi-node distribution strategy,
you can actually achieve that using a little trick that's
available in the Keras, and that's called--
it's a function called the estimator 2 model.
estimator 2 model-- the model 2 estimator.
TF dot Keras estimator-- model 2 estimator that takes a Keras
model as an argument and then it actually
returns an estimator that you can
use for multi-node training.
So how do we set up a multi-node training environment
in the first place?
This was a really, really difficult problem
up until the technology that's open source now
called Kubernetes was released.
And so we-- even though you can set up multi-node training
with TensorFlow without running Kubernetes,
it will certainly help to use Kubernetes as the orchestration
platform to fire up multiple modes.
And Kubernetes this is available in most clouds
GCP and I think AWS and others as well.
So how does that work?
Well, a Kubernetes cluster contains a set of nodes.
So in this particular picture, you can see three nodes.
In each of them is a worker node.
And what TensorFlow requires in order for this to work
is that each of these nodes have an environment variable called
TF underscore config defined.
So every single node that you're having your cluster
needs to have this variable defined.
And in this TF config, you have two parts, first of all,
the cluster part, which defines all
of the hosts that participates in the distributed training,
all the nodes in your cluster.
And the second one is really to specify
who am I. What is my identity within this cluster?
So you can see the task here is 0.
So this worker is host 1 port 1.
It's 1.
That's host 2 port, and it's 2, meaning that it's
host 3 and that-- at that port.
So that's how you need to configure your cluster in order
to do this.
So that is really cumbersome to go around and round to all
of the nodes and actually provide the specific
configuration and Kubernetes provides--
so how do you configure this--
Kubernetes provides an excellent way
of doing that through its deployment configuration,
the yaml file, so you can actually
distribute the configuration, the environment variables
to set on the respective nodes.
So how do we integrate that with TensorFlow?
Well, it's part of the initial support.
And this is just one way of doing it.
There are multiple ways, but this
is one way that we've tested.
You can use a template engine called Jinja.
And you create a file called a Jinja file,
and there is actually such a file
available in the TensorFlow slash ecosystem
repository, observe not the TensorFlow repository.
This is the ecosystem.
There will be a directory under that repository called
distribution underscore strategy that
contains useful functions to use with distribution strategies.
So you can use this file as a template
in order to automatically generate
the deployment dot yaml for the Kubernetes cluster.
So what would that look like for a configuration like this
where we have three nodes?
Well, it's really, really simple.
The only thing you need to do in this file--
the Jinja file-- is the highlighted configuration
up here.
You set the worker replicas to three nodes.
The rest is just code that you keep for all of the executions
you setup to do.
Make sense?
So this is actually a macro that populates TF config based
on this parameter up here.
So that's very simple, but what about the code?
We've now configured the Kubernetes cluster
to be able to do this distributed
training with TensorFlow, but there are also
some stuff we need to do with the code
as we had for the single node as well.
So it's approximately the same as for single node, the multi
GPU configuration.
So this is the estimator lingo.
So I provide a config here.
You see the run config?
It's just a standard estimator construct.
And I set the train distribute parameter to TF
dot config distribute collective AllReduce strategy, so not
mirrored strategy for multi-node configuration.
It's collective AllReduce strategy.
And then I specify the number of GPUs
I have available for each of these workers that I
have my cluster.
And that's it.
Given that I have that config object,
I can just put that as part of the config parameter
when I do the conversion from Keras over to an estimator.
And I now have multi-GPU--
multi-node, multi-GPU in each of the nodes
configured for TensorFlow.
And so let's look at this collective AllReduce strategy
because that's something different than what
we talked about previously with a mirrored strategy.
So what is that thing?
Well, it is specifically designed for multiple worker
nodes.
And it's essentially based on mirrored strategy,
but it adds functionality in order
to deal with multi-host or multi-workers in my cluster.
And the good thing about this is that it automatically
selects the best algorithm for doing reduce--
the AllReduce function across this cluster.
So what does that mean?
What kind of algorithms do we have
for doing AllReduce in a multi-node configuration?
Well, one of them is very simple--
very similar to what we have for a single node, which
is to ring-all reduce in which case the GPUs,
they just travel across the nodes
and they perform an overall ring reduce across multiple hosts
and GPUs.
So essentially the same as for single node.
It's just that they are traversing hosts
with all of the penalties associated of course
of doing that depending on the interconnect
between these hosts.
Another algorithm is hierarchical all reduced.
I think that this really complicated English word.
And what happens here is that we essentially
pass all of the variables up to a single GPU card
on the respective hosts.
See that.
We all send them missing an error-- two errors over
here-- with one arrow here.
Never mind that.
They're supposed to all send this stuff to GPU 0, GPU 1.
And then we do an AllReduce across the nodes there.
And the GPUs performing that operation then
propagates back to the individual GPUs
within its own node.
So depending on network and other characteristics
of your setup and hardware, one of these solutions
would work very well.
And the thing with collective overdue strategy
is they will automatically detect the best algorithm
to use in your distributed cluster.
So that was multi-node, multi-accelerator cards
within the nodes.
There are also other ways to scale to multiple nodes
with TensorFlow.
And one of them-- how many of you
are familiar with parameter server strategy?
Parameter servers?
This is the classical way of how you do TensorFlow distributed
training.
And eventually this-- actually this way,
the classical way, you should not continue to do that.
You should actually-- once we roll out
distribution strategies, that's the way to go.
So what I'm describing here is essentially the parameter
server strategy, but instead of describing it
in the old classical way of doing TensorFlow,
I'm going to describe how to do it
with distribution strategies.
Does that make sense?
Yeah.
If you didn't understand that and you haven't used TO1,
just don't worry about it.
Just listen to what I have to say here.
To get a recap of what the parameter service strategy is,
it's essentially a strategy where we have shared storage.
We have a number of worker nodes,
and they're working on batches of shared stories.
They're working completely independently.
Well, not completely we'll see shortly.
But they are working independently
calculating gradients based on batches.
And then we have a number of parameter servers.
So these workers, when they are finished with the batch,
they send it up to the parameter servers.
The parameter servers, they have the updates
from the other workers, so they calculate
the average of the gradients and then pass
all of those variables down to the workers.
So it's not synchronous.
These workers, they will get updates
on the variables in that synchronous fashion, which
has good sides and bad sides.
The good side is one worker can go out,
and the other workers can still execute as normal.
That's the way this works.
So how can we set this up in a distributed strategy cluster?
Well, it's real easy.
Instead of just specifying the worker replicas
in their Jinja file, we also specify
the PS underscore replicas.
So that's the number of parameter servers
that we have in our Kubernetes cluster.
So that is the Kubernetes setup.
Now what about the code?
So that's also really easy.
You saw the run config--
the config parameter previously.
Instead of using the collective AllReduce strategy--
I got that right this time-- collective AllReduce strategy,
you used the parameter server strategy.
See that?
So it's just another type there.
You still specified the number of GPUs per worker,
you specify the config object to--
Keras model to estimator function
call, and you're all done.
So very, very few lines of code needs
changing even though we're talking
about massively different way of doing distributed TensorFlow--
TensorFlow training.
There is one more configuration that we are working on.
I think we will have a release of this in 1.11 at least
we can try out.
That is a really, really cool setup
where you actually run distributed training
from your laptop.
And in this particular case, you have all of your model training
code here.
And the only thing you--
so forget about parameter server.
Now we're back to multiple workers and AllReduce here.
The only thing you fire up on these workers
is the TF underscore STD underscore server dot pi
or whatever variant of that you want
to use because this code is available also
in the TensorFlow ecosystem repository.
So you can go check it out how we did it
for this normal setup, and you can change it
to whatever way you want.
The thing is that this script and installation
of the workers, they don't have the model program at all.
So when we fire up the model training from our laptop
or workstation here, it will distribute that model over
to those.
So if you have any changes to your model code,
you can just make it locally, and it will automatically
distribute that out to all of the workers.
Now you may say, oh, that's a hassle because now I've
got to install this script on all the workers.
And you do not have to do that because the only thing you do
is just specify the script parameter in the Jinja file
that you've seen a couple of times now--
and we have the same number of workers here--
and that means that the scripts will actually
start on all of these nodes.
So what we're talking about here is the capability
to fire up a Kubernetes cluster with an arbitrary
number of nodes.
Without any installation of code,
you can use a local laptop, and it will automatically
distribute the model and the training
to all of these worker nodes just
by having these two lines here.
What about the code?
So again, we have the wrong config here.
And this time, we're going to set
a parameter called experimental distribute
to the distribute config.
And as part of distribute config,
we are going to embed a collective AllReduce
strategy with, as we saw before, the number of GPUs
we have per worker.
But the distributed config requires one more parameter,
and that is the remote cluster.
The cluster-- the master node here
needs to know the cluster to which it should
send all the model code for these demos that
are waiting there for the model code to be shared.
Make sense?
So you gotta specify that parameters.
Then you're finishing up your config object
in model testimony to specify the config object.
And as you've seen before, it's just
a couple of lines of difference between
these different configurations.
That's really it for TensorFlow multi-node training.
So let's summarize what we've talked about here today.
First of all, we went through the single node distribution
strategy setup.
We talked about the mirrored strategy for multiple GPUs
within a single node, and we talked about the TPU strategy
to distribute work to the TPUs.
We also went through the AllReduce algorithm,
which is used by distribution strategy
to be able to do this single load distribution.
Then we talked about multi-node distribution,
talked about using Kubernetes to distribute TensorFlow training
using these Jinja files that compiles or translates over
to the yaml file for deployment.
We talked about the AllReduce using
collective AllReduce strategy.
We talked about the parameter server setup
with distribution strategy.
And then finally we talked about distributed training
from a standalone client distributing the model code
over to the workers.
Work in progress.
So most of the stuff that we talked about today, you'll
find TensorFlow dot contrib dot distribution strategy--
you'll find that in the TensorFlow repository.
But as part of 1.11, many of these things
that we talked about you will be able to start to use.
If you really want try out, you can also check out Nike
and see how far we've gone.
But 1.11 should be out shortly.
We are working on performance still.
This is always going to be something
that we're going to work on to match state of the art.
As you saw with single node multi-GPU,
we've achieved 90% to 95% for scaling
a performance on the GPU cards.
We're continuously working on trying
to improve this for all of the different configurations
we've talked about.
TPU strategy will be available as part of 1.12 with Keras.
Right now it's only available in estimator.
But remember we have the estimator trick, model
to estimator within Keras, so you can actually
take a Keras model, convert to an estimator,
and still use GPUs strategy that will be part of 1.11.
Multi-worker GPU support is something we're also
working on as I said.
So that means that in Keras--
native Keras code, we can actually
specify multi-worker and GPU support and also
Eager execution.
How many of you are familiar with Eager execution?
Got to check that out.
That's a really important feature of TensorFlow.
So if you're not using Eager, you
should definitely stop using anything else
and start using Eager.
The entire getting started experience of TensorFlow
is based on Eager mode, and we will
have great performance bridges between Eager execution
and graph mode execution and all of this distribution.
So the entire architecture builds on this,
so you should check it out.
Eager execution is also something
we're working on so you can directly in Eager execution
mode utilize multiple GPU cards in multiple nodes
in the same way that we discussed in the setup.
And then when we have multi-worker GPUs,
obviously if one fails and we talk
about this AllReduce synchronous gradient updates,
we do have a discussion of fault tolerance.
So that's something we're looking into
to build into this, so we have more resilience with respect
to defaults.
So another summary, what did we talk about today?
We talked about the distribution API, which is very easy to use.
It's the new way of doing distributed TensorFlow
training.
Forget about anything that you did before.
Start to learn about how to do this.
We talked about distribution strategies
having great performance right out of the box.
We saw the scale in between 1 and 8 GPUs
on a Kubernetes cluster.
And then we looked at how it can scale across GPUs--
different accelerators, GPUs as well as TPUs,
single node as well as multi-node and TPU pod.
And that's it.
You should definitely take a picture of this slide
because this slide summarizes all of the resources
that we had.
And with that, we are done.
Thank you very much for listening to this.
If you have any questions--