Subtitles section Play video Print subtitles PRIYA GUPTA: Let's begin with the obvious question. Why should one care about distributed training? Training complex neural networks with large amounts of data can often take a long time. In the graph here, you can see training the resident 50 model on a single but powerful GPU can take up to four days. If you have some experience running complex machine learning models, this may sound rather familiar to you. Bringing down your training time from days to hours can have a significant effect on your productivity because you can try out new ideas faster. In this talk, we're going to talk about distributed training, that is running training in parallel on multiple devices such as CPUs, GPUs, or TPUs to bring down your training time. With the techniques that you-- we'll talk about in this talk, you can bring down your training time from weeks or days to hours with just a few lines of change of code and some powerful hardware. To achieve these goals, we're pleased to introduce the new distribution strategy API. This is an easy way to distribute your TensorFlow training with very little modification to your code. With distribution strategy API, you no longer need to place ops or parameters on specific devices, and you don't need to restructure a model in a way that the losses and gradients get aggregated correctly across the devices. Distribution strategy takes care of all of that for you. So let's go with what are the key goals of distribution strategy. The first one is ease of use. We want you to make minimal code changes in order to distribute your training. The second is to give great performance out of the box. Ideally, the user shouldn't have to change any-- change or configure any settings to get the most performance out of their hardware. And third we want distribution strategy to work in a variety of different situations, so whether you want to scale your training on different hardware like GPUs or TPUs or you want to use different APIs like Keras or estimator or if you want to run distributed-- different distribution architectures like synchronous or asynchronous training, we have one distribution strategy to be useful for you in all these situations. So if you're just beginning with machine learning, you might start your training with a multi-core CPU on your desktop. TensorFlow takes care of scaling onto a multi-core CPU automatically. Next, you may add a GPU to your desktop to scale up your training. As long as you build your program with the right CUDA libraries, TensorFlow will automatically run your training on the GPU and give you a nice performance boost. But what if you have multiple GPUs on your machine, and you want to use all of them for your training? This is where distribution strategy comes in. In the next section, we're going to talk about how you can use distribution strategy to scale your training to multiple GPUs. First, we'll look at some code to train the ResNet 50 model without any distribution. We'll use a Keras API, which is the recommended TensorFlow high level API. We begin by creating some datasets for training and validation using the TF data API. For the model, we'll simply reuse the ResNet 50 that's prepackaged with Keras and TensorFlow. Then we create an optimizer that we'll be using in our training. Once we have these pieces, we can compile the model providing the loss and optimizer and maybe a few other things like metrics, which I've omitted in the slide here. Once a model's compiled, you can then begin your training by calling model dot fit, providing the training dataset that you created earlier, along with how many epochs you want to run the training for. Fit will train your model and update the models variables. Then you can call evaluate with the validation dataset to see how well your training did. So given this code to run your training on a single machine or a single GPU, let's see how we can use distribution strategy to now run it on multiple GPUs. It's actually very simple. You need to make only two changes. First, create an instance of something called mirrored strategy and second pass the strategy instance to the compile call with the distribute argument. That's it. That's all the code changes you need to now run this code on multiple GPUs using distribution strategy. Mirror strategy is a type of distribution strategy API that we introduced earlier. This API is available intensive on point 11 release, which will be out very shortly. And in the bottom of the slide, we've linked to a complete example of training [INAUDIBLE] with Keras and multiple GPUs that you can try out. With mirror strategy, you don't need to make any changes to your model code or your training loop, so it makes it very easy to use. This is because we've changed many underlying components of TensorFlow to be distribution aware. So this includes the optimizer, batch norm layers, metrics, and summaries are all now distribution aware. You don't need to make any changes to your input pipeline as well as long as you're using the recommended TF data APIs. And finally saving and checkpointing work seamlessly as well. So you can save with no or one distribution strategy and a store with another seamlessly. Now that you've seen some code on how to use mirror strategy to scale to multiple GPUs, let's look under the hood a little bit and see what mirror strategy does. In a nutshell, mirror strategy implements data parallelism architecture. It mirrors the variables on each device EGPU and hence the name mirror strategy, and it uses AllReduce to keep these variables in sync. And using these techniques, it implements synchronous training. So that's a lot of terminology. Let's unpack each of these a bit. What is data parallelism? Let's say you have end workers or end devices. In data parallelism, each device runs the same model and computation but for the different subset of the input data. Each device computes the loss and gradients based on the training samples that it sees. And then we combine these gradients and update the models parameters. The updated model is then used in the next round of computation. As I mentioned before, mirror strategy mirrors the variables across the different devices. So let's say you have a variable A your model. It'll be replicated as A0, A1, A2, and A3 across the four different devices. And together these four variables conceptually form a single conceptual variable called a mirrored variable. These variables are kept in sync by applying identical updates. A class of algorithms called AllReduce can be used to keep variables in sync by applying identical gradient updates. AllReduce algorithms can be used to aggregate the gradients across the different devices, for example, by adding them up and making them available on each device. It's a fused algorithm that can be very efficient and reduce the overhead of synchronization by quite a bit. There are many versions of algorithm-- AllReduce algorithms available based on the communication available between the different devices. One common algorithm is what is known as ring all-reduce. In ring all-reduce, each device sends a chunk of its gradients to its successor on the ring and receives another chunk from its predecessor. There are a few more such rounds of rate and exchanges, and at the end of these exchanges, each device has received a combined copy of all the gradients. Ring-all reduce also uses network bandwidth optimally because it ensures that both the upload and download bandwidth at each host is fully utilized. We have a team working on fast implementations of all reduce for various network topologies. Some hardware vendors such as the Nvidia provide specialized implementation of all-reduce for their hardware, for example, Nvidia [INAUDIBLE]. The bottom line is that AllReduce can be fast when you have multiple devices on a single machine or a small number of machines with strong connectivity. Putting all these pieces together, mirror strategy uses mirrored variables and all reduce to implement synchronous training. So let's see how that works. Let's say you have two devices, device 0 and 1, and your model has two layers, A and B. Each layer has a single variable. And as you can see, the variables are replicated across the two devices. Each device received one subset of the input data, and it computes the forward pass using its local copy of the variables. It then computes a backward pass and computes the gradients. Once agreements are computed on each device, the devices communicate with each other using all reduce to aggregate the gradients. And once the gradients are aggregated, each device updates its local copy of the variables. So in this way, the devices are always kept in sync. The next forward pass doesn't begin until each device has received a copy of the combined gradients and updated its variables. All reduce can further optimize things and bring down your training time by overlapping computation of gradients at lower layers in the network with transmission of gradients at the higher layers. So in this case, you can see-- you can compute the gradients of layer A while you're transmitting the gradients for layer B. And this can further reduce your training time. So now that we've seen how mirror strategy looks under the hood, let's look at what type of performance and scaling you can expect when using mirror strategy with multi-- for multiple GPUs. We use a ResNet 50 model with ImageNet dataset for our benchmarking. It's a very popular benchmark for performance measurement. And we use Nvidia Teslas V100 GPUs on Google Cloud. And we use a bat size of 128 per GPU. On the x-axis here, you can see the number of GPUs, and on the y-axis, you can see images per second process during training. As you can see, as we increase the number of GPUs from one to two to four to eight, the images per second processed is close to doubling every time. In fact, we're able to achieve 90% to 95% scaling out of the box. Note that these numbers were obtained by using the ResNet 50 model that's available in our official model garden depot, and currently it uses the estimator API. We're working on Keras performance actively. So far, we've talked a lot about scaling onto multiple GPUs. What about cloud TPUs? TPU stands for a tensor processing units. These are custom ASIC, designed and built by Google especially for accelerating machine learning workloads. In the picture here, you can see the various generations of TPUs. On the top left, you can see TPUE1. In the middle you can see cloud TPUE2, which is now generally available in Google Cloud. And on the right side you can see TPUE3, which was just announced in Google I/O a few months ago and is now available in alpha. And in the bottom of the slide, you can see a TPU pod, which is a number of cloud TPUs that are interconnected to each other using a custom network. TPU pods are also now available in alpha. So if you want to learn more about TPUs, please attend Frank's talk tomorrow on cloud TPUs. In this talk, we're just going to focus on how you can use distribution strategy to scale your training on TPUs. It's actually very similar to what we just saw with mirror strategy, but instead we'll use TPU strategy this time. So first you create an instance of a TPU cluster resolver and give it the name of your cloud TPU resource. Then you pass the clusters over to the TPU strategy constructor along with another argument called steps per run, which I'll come back to in a bit. That's it. Once you have the strategy instance, you can pass it to your compile call as before, and your training will now run on cloud TPUs. So you can see, the distribution strategy makes it really easy to switch between different types of hardware. This API will be available in the next TensorFlow release, which is 1.12. And in the bottom of the slide, we've provided a link to training ResNet 50 with the estimator API using TPU strategy. So let's talk a little bit about what TPU strategy does. TPU strategy implements the same architecture as mirror strategy. That is it implements data parallelism with synchronous training. The cores on a TPU, there are eight cores on a single cloud TPU. And these cores are connected via fast interconnects. And this means that you can do AllReduce really fast to aggregate the gradients. Coming back to those steps per run parameter from the previous slide, for most models the computation time of a single step is small compared to the sum of the communication overheads. So it makes sense to run multiple steps at a time to amortize these overheads. So setting this number to a high value like 100 will give you the best performance out of the TPUs. The TPU teams are working on reducing these overhead so that in the future you may not need to specify this argument anymore. And finally you can also use TPU strategy to scale to cloud TPU pods, which are, as I mentioned, in alpha release right now. TPU pods consist of many clouds TPUs interconnected via fast network. And this means that AllReduce across these different all TPU pods can be really fast as well. So that's all about cloud TPUs. I'll hand it off to my colleague Magnus to talk about scaling onto multi-node with GPUs. MAGNUS HYTTSTEN: Thank you. So that was how we scale on multiple GPU cores from the single node. What about multiple nodes the way we have multiple computers? Because the fact is that even though you can cram in a lot of GPU cards, for example, on a single computer, sooner or later, if you do massive amounts of training, you will need to consider an architecture where you can scale out the multiple nodes as well. So this is an example where we see four worker nodes with four GPU cards in each of them. In terms of support for multi-GPU-- multi-node support, we have currently support for premade estimators in terms of [INAUDIBLE] 1.11, which is subject to be released shortly. And we are working very, very hard with some awesome developers to get this support into Keras as well. So you should be aware that Keras support will be there as soon as possible. However, if you do want to use Keras with a multi-node distribution strategy, you can actually achieve that using a little trick that's available in the Keras, and that's called-- it's a function called the estimator 2 model. estimator 2 model-- the model 2 estimator. TF dot Keras estimator-- model 2 estimator that takes a Keras model as an argument and then it actually returns an estimator that you can use for multi-node training. So how do we set up a multi-node training environment in the first place? This was a really, really difficult problem up until the technology that's open source now called Kubernetes was released. And so we-- even though you can set up multi-node training with TensorFlow without running Kubernetes, it will certainly help to use Kubernetes as the orchestration platform to fire up multiple modes. And Kubernetes this is available in most clouds GCP and I think AWS and others as well. So how does that work? Well, a Kubernetes cluster contains a set of nodes. So in this particular picture, you can see three nodes. In each of them is a worker node. And what TensorFlow requires in order for this to work is that each of these nodes have an environment variable called TF underscore config defined. So every single node that you're having your cluster needs to have this variable defined. And in this TF config, you have two parts, first of all, the cluster part, which defines all of the hosts that participates in the distributed training, all the nodes in your cluster. And the second one is really to specify who am I. What is my identity within this cluster? So you can see the task here is 0. So this worker is host 1 port 1. It's 1. That's host 2 port, and it's 2, meaning that it's host 3 and that-- at that port. So that's how you need to configure your cluster in order to do this. So that is really cumbersome to go around and round to all of the nodes and actually provide the specific configuration and Kubernetes provides-- so how do you configure this-- Kubernetes provides an excellent way of doing that through its deployment configuration, the yaml file, so you can actually distribute the configuration, the environment variables to set on the respective nodes. So how do we integrate that with TensorFlow? Well, it's part of the initial support. And this is just one way of doing it. There are multiple ways, but this is one way that we've tested. You can use a template engine called Jinja. And you create a file called a Jinja file, and there is actually such a file available in the TensorFlow slash ecosystem repository, observe not the TensorFlow repository. This is the ecosystem. There will be a directory under that repository called distribution underscore strategy that contains useful functions to use with distribution strategies. So you can use this file as a template in order to automatically generate the deployment dot yaml for the Kubernetes cluster. So what would that look like for a configuration like this where we have three nodes? Well, it's really, really simple. The only thing you need to do in this file-- the Jinja file-- is the highlighted configuration up here. You set the worker replicas to three nodes. The rest is just code that you keep for all of the executions you setup to do. Make sense? So this is actually a macro that populates TF config based on this parameter up here. So that's very simple, but what about the code? We've now configured the Kubernetes cluster to be able to do this distributed training with TensorFlow, but there are also some stuff we need to do with the code as we had for the single node as well. So it's approximately the same as for single node, the multi GPU configuration. So this is the estimator lingo. So I provide a config here. You see the run config? It's just a standard estimator construct. And I set the train distribute parameter to TF dot config distribute collective AllReduce strategy, so not mirrored strategy for multi-node configuration. It's collective AllReduce strategy. And then I specify the number of GPUs I have available for each of these workers that I have my cluster. And that's it. Given that I have that config object, I can just put that as part of the config parameter when I do the conversion from Keras over to an estimator. And I now have multi-GPU-- multi-node, multi-GPU in each of the nodes configured for TensorFlow. And so let's look at this collective AllReduce strategy because that's something different than what we talked about previously with a mirrored strategy. So what is that thing? Well, it is specifically designed for multiple worker nodes. And it's essentially based on mirrored strategy, but it adds functionality in order to deal with multi-host or multi-workers in my cluster. And the good thing about this is that it automatically selects the best algorithm for doing reduce-- the AllReduce function across this cluster. So what does that mean? What kind of algorithms do we have for doing AllReduce in a multi-node configuration? Well, one of them is very simple-- very similar to what we have for a single node, which is to ring-all reduce in which case the GPUs, they just travel across the nodes and they perform an overall ring reduce across multiple hosts and GPUs. So essentially the same as for single node. It's just that they are traversing hosts with all of the penalties associated of course of doing that depending on the interconnect between these hosts. Another algorithm is hierarchical all reduced. I think that this really complicated English word. And what happens here is that we essentially pass all of the variables up to a single GPU card on the respective hosts. See that. We all send them missing an error-- two errors over here-- with one arrow here. Never mind that. They're supposed to all send this stuff to GPU 0, GPU 1. And then we do an AllReduce across the nodes there. And the GPUs performing that operation then propagates back to the individual GPUs within its own node. So depending on network and other characteristics of your setup and hardware, one of these solutions would work very well. And the thing with collective overdue strategy is they will automatically detect the best algorithm to use in your distributed cluster. So that was multi-node, multi-accelerator cards within the nodes. There are also other ways to scale to multiple nodes with TensorFlow. And one of them-- how many of you are familiar with parameter server strategy? Parameter servers? This is the classical way of how you do TensorFlow distributed training. And eventually this-- actually this way, the classical way, you should not continue to do that. You should actually-- once we roll out distribution strategies, that's the way to go. So what I'm describing here is essentially the parameter server strategy, but instead of describing it in the old classical way of doing TensorFlow, I'm going to describe how to do it with distribution strategies. Does that make sense? Yeah. If you didn't understand that and you haven't used TO1, just don't worry about it. Just listen to what I have to say here. To get a recap of what the parameter service strategy is, it's essentially a strategy where we have shared storage. We have a number of worker nodes, and they're working on batches of shared stories. They're working completely independently. Well, not completely we'll see shortly. But they are working independently calculating gradients based on batches. And then we have a number of parameter servers. So these workers, when they are finished with the batch, they send it up to the parameter servers. The parameter servers, they have the updates from the other workers, so they calculate the average of the gradients and then pass all of those variables down to the workers. So it's not synchronous. These workers, they will get updates on the variables in that synchronous fashion, which has good sides and bad sides. The good side is one worker can go out, and the other workers can still execute as normal. That's the way this works. So how can we set this up in a distributed strategy cluster? Well, it's real easy. Instead of just specifying the worker replicas in their Jinja file, we also specify the PS underscore replicas. So that's the number of parameter servers that we have in our Kubernetes cluster. So that is the Kubernetes setup. Now what about the code? So that's also really easy. You saw the run config-- the config parameter previously. Instead of using the collective AllReduce strategy-- I got that right this time-- collective AllReduce strategy, you used the parameter server strategy. See that? So it's just another type there. You still specified the number of GPUs per worker, you specify the config object to-- Keras model to estimator function call, and you're all done. So very, very few lines of code needs changing even though we're talking about massively different way of doing distributed TensorFlow-- TensorFlow training. There is one more configuration that we are working on. I think we will have a release of this in 1.11 at least we can try out. That is a really, really cool setup where you actually run distributed training from your laptop. And in this particular case, you have all of your model training code here. And the only thing you-- so forget about parameter server. Now we're back to multiple workers and AllReduce here. The only thing you fire up on these workers is the TF underscore STD underscore server dot pi or whatever variant of that you want to use because this code is available also in the TensorFlow ecosystem repository. So you can go check it out how we did it for this normal setup, and you can change it to whatever way you want. The thing is that this script and installation of the workers, they don't have the model program at all. So when we fire up the model training from our laptop or workstation here, it will distribute that model over to those. So if you have any changes to your model code, you can just make it locally, and it will automatically distribute that out to all of the workers. Now you may say, oh, that's a hassle because now I've got to install this script on all the workers. And you do not have to do that because the only thing you do is just specify the script parameter in the Jinja file that you've seen a couple of times now-- and we have the same number of workers here-- and that means that the scripts will actually start on all of these nodes. So what we're talking about here is the capability to fire up a Kubernetes cluster with an arbitrary number of nodes. Without any installation of code, you can use a local laptop, and it will automatically distribute the model and the training to all of these worker nodes just by having these two lines here. What about the code? So again, we have the wrong config here. And this time, we're going to set a parameter called experimental distribute to the distribute config. And as part of distribute config, we are going to embed a collective AllReduce strategy with, as we saw before, the number of GPUs we have per worker. But the distributed config requires one more parameter, and that is the remote cluster. The cluster-- the master node here needs to know the cluster to which it should send all the model code for these demos that are waiting there for the model code to be shared. Make sense? So you gotta specify that parameters. Then you're finishing up your config object in model testimony to specify the config object. And as you've seen before, it's just a couple of lines of difference between these different configurations. That's really it for TensorFlow multi-node training. So let's summarize what we've talked about here today. First of all, we went through the single node distribution strategy setup. We talked about the mirrored strategy for multiple GPUs within a single node, and we talked about the TPU strategy to distribute work to the TPUs. We also went through the AllReduce algorithm, which is used by distribution strategy to be able to do this single load distribution. Then we talked about multi-node distribution, talked about using Kubernetes to distribute TensorFlow training using these Jinja files that compiles or translates over to the yaml file for deployment. We talked about the AllReduce using collective AllReduce strategy. We talked about the parameter server setup with distribution strategy. And then finally we talked about distributed training from a standalone client distributing the model code over to the workers. Work in progress. So most of the stuff that we talked about today, you'll find TensorFlow dot contrib dot distribution strategy-- you'll find that in the TensorFlow repository. But as part of 1.11, many of these things that we talked about you will be able to start to use. If you really want try out, you can also check out Nike and see how far we've gone. But 1.11 should be out shortly. We are working on performance still. This is always going to be something that we're going to work on to match state of the art. As you saw with single node multi-GPU, we've achieved 90% to 95% for scaling a performance on the GPU cards. We're continuously working on trying to improve this for all of the different configurations we've talked about. TPU strategy will be available as part of 1.12 with Keras. Right now it's only available in estimator. But remember we have the estimator trick, model to estimator within Keras, so you can actually take a Keras model, convert to an estimator, and still use GPUs strategy that will be part of 1.11. Multi-worker GPU support is something we're also working on as I said. So that means that in Keras-- native Keras code, we can actually specify multi-worker and GPU support and also Eager execution. How many of you are familiar with Eager execution? Got to check that out. That's a really important feature of TensorFlow. So if you're not using Eager, you should definitely stop using anything else and start using Eager. The entire getting started experience of TensorFlow is based on Eager mode, and we will have great performance bridges between Eager execution and graph mode execution and all of this distribution. So the entire architecture builds on this, so you should check it out. Eager execution is also something we're working on so you can directly in Eager execution mode utilize multiple GPU cards in multiple nodes in the same way that we discussed in the setup. And then when we have multi-worker GPUs, obviously if one fails and we talk about this AllReduce synchronous gradient updates, we do have a discussion of fault tolerance. So that's something we're looking into to build into this, so we have more resilience with respect to defaults. So another summary, what did we talk about today? We talked about the distribution API, which is very easy to use. It's the new way of doing distributed TensorFlow training. Forget about anything that you did before. Start to learn about how to do this. We talked about distribution strategies having great performance right out of the box. We saw the scale in between 1 and 8 GPUs on a Kubernetes cluster. And then we looked at how it can scale across GPUs-- different accelerators, GPUs as well as TPUs, single node as well as multi-node and TPU pod. And that's it. You should definitely take a picture of this slide because this slide summarizes all of the resources that we had. And with that, we are done. Thank you very much for listening to this. If you have any questions--
B1 strategy training gpus distribution node tpu Distributed TensorFlow (TensorFlow @ O’Reilly AI Conference, San Francisco '18) 4 0 林宜悉 posted on 2020/04/04 More Share Save Report Video vocabulary