Subtitles section Play video Print subtitles TAYLOR ROBIE: I'm Taylor. I'm a engineer on the TF Keras team. PRIYA GUPTA: And I'm Priya. I'm also an engineer on that in TensorFlow team. I lead the distribution strategy team. TAYLOR ROBIE: And today we're going to be talking talking about writing performant models in TensorFlow 2. Just to give a brief outline of the talk, we're, first, going to talk about the structure of a training loop in TensorFlow 2. And then we're going to talk about the different APIs that you can use to make your training more performant. The first API that we're be talking about is tf.data. This is the API for writing high performance pipelines to avoid various sorts of stalls and make sure that your training always has data as it's ready to consume it. Next we going to be talking about tf.function. This is how you can formulate your training step in a way that is efficient so that the runtime can run with very minimal overhead. And, finally, we're going to be talking about tf.distribute which is how to target your training to specific hardware and also scale out to more hardware. In order to demonstrate all of these APIs, we're going to be walking through a case study starting with the most naive implementation and then progressively adding more performant APIs and looking at how that helps our training. So we're going to be training a cat classifier. Because we're an internet company, that's what we do. And we're going to be training it on MobileNet V2. This task was largely chosen arbitrarily just as a representative task, so these lessons should be broadly applicable to your workflows as well. And there's a notebook. And I'll put the link back at the end of the slide so you're encouraged to take a look afterward and compare it back to the talk. So let's get started. A brief overview of training loops in TensorFlow 2. We're going to be writing a custom training loop. And the reason is that we want to look at the mechanics of what the system is actually doing and how we can make it performant. If you're used to using a higher level API, like Keras model fit, these ap-- these lessons are still broadly applicable it's just that some of this will be done automatically for you. So if we want to show why these APIs are important, it's useful to really dig into the internals of the training loop. So because we're training on MobileNet we've chosen to use the Keras Applications MobileNet V2 which is just a canned MobileNet V2 implementation. We're training a classifier. So we're going to be using sigmoid cross entropy with logits. This is a numerically stable cross entropy function. And then, finally, we're going to be using the Keras SGD optimizer. So if we were to call our model on our features, in this case it's an image classification task so our features are an image, we would get logics which is the predicted class. If we then apply our loss function comparing the labels and the logits, this gives us a scalar loss which we could then use to optimize and update our model. We want to call all of this under a tf.GradientTape. What the gradient tape does is it watches the computations as they're performed, and maintains the metadata, and keeps the activations alive. This is what allows us to take gradients which we can then pass to the optimizer to update our model weights. And, finally, if we wrap all of this into a step and then iterate over our data in many batches, applying this step in each mini batch, this comprises the general structure of our training loop. On the data side, we're going to start out with just a straw man Python generator to show the different parts of the data pipeline and then we'll look at its performance a little bit later. So what do we need to do? Well, first, we'll want to shuffle the data so that we see a different ordering each epoch. We'll need to load all of the images from disk into memory so that they can be consumed by the training process. We'll also want to do some augmentation. So we'll want to resize the images from their native format down-- back to the format that the model expects. And then we'll do some task-specific optimization. So, in this case, we'll randomly flip the images and we'll add some pixel level noise to make our training a little bit more robust. And, finally, we'll need to batch these. So the generator that I've shown is producing examples one at a time, but we'll need to collect them into mini batches. And the key operation for that batching operation is a concatenation. It's worth noting that different parts of the data pipeline will stress different parts of the system. So for loading the disk-- this is an I/O-bound task and we'll generally want to consume this I/O as fast as possible so that we're not constantly waiting for images to arrive from disk one at a time. Secondly, the augmentations tends to be rather CPU intensive because we're doing various sorts of math in our augmentation. And, finally, batching tends to be somewhat memory intensive task because we're copying examples from their original location into the memory buffer of our mini batch. So now that we have a scaffolding, let's just go ahead and run this. We're going to be running on an Nvidia V100 GPU. By default, TensorFlow will attempt to place ops on the optimal device. And because we're doing large matrix multiply operations, it will place them on the GPU. However, we find that our training, at the start, is pretty lackluster. We're under 100 images a second. And if we actually look at our device utilization, it's well under 20%. So what's going on here? Well, in order to determine that, we can actually do some profiling. And TensorFlow 2 TensorBoard makes this quite easy to do. You might already be familiar with TensorBoard. This is the standard way of monitoring training as it runs. So you might be, for instance, familiar with the scalars tab which will show things like losses in metrics as your training continues. Well, in TensorFlow 2 there's a new tab and that is the profiler tab. And what this does is it looks at each op as it runs in TensorFlow and shows you a timeline of how you're training is progressing. And we're going to look in a little bit more detail about how to read and interpret these timelines. Before we do, however, let's talk about how to enable it. So it's quite straightforward. You simply turn on the profiler. This will signal to the runtime that it should keep records of ops as it runs them. Write your program as normal, and then simply export this trace, and this will put it in a format that TensorBoard can consume and display for you. So this is a timeline. Moving left to right, we have time and then each row is a different thread. And you can see that they're annotated with device. So the top row is the CPU and then the bottom ones are the various GPU threads. And, in this case, we're looking at three different training steps. Each vertical line is an individual op that's been run by TensorFlow. So you can see that the ops are scheduled from the CPU and then they appear on the GPU stream. And the GPU actually executes the ops and this is where the heavy lifting is done. But you can also see that we have these large gaps in between where no ops are running and our system is essentially stalled. And this is because we have-- after each step we have to wait for our naive generator to produce the next batch of data, so this is obviously quite inefficient. So Priya's going to tell you how we can improve that. PRIYA GUPTA: Thanks, Taylor. So as Taylor just showed us, writing your own input pipeline in Python to read data and transform it can be pretty inefficient. TensorFlow provides the tf.data API to allow you to easily build performance and scalable input pipelines. You can think of the tf.data input pipeline as an ETL process. So the first stage is the extract stage where we read the data from, let's say, network storage or from your local disk, and then you potentially are parsing the file format. The second stage is the transform stage, which is where you take the input file data and transform it into a form that's amenable to your ML computation. So this could be image transformations specific to your ML task, or there could be generic transformations like shuffling or batching that apply to a lot of ML tasks. Once you transform this data, you can then load it into your accelerator for your training. So let's take a look at some code to see how you can use tf.data to write the same input pipeline that Taylor showed you before but in a much more efficient way. So there's a number of steps here, and I'll go over them one by one. The first one is we create a simple data set consisting of all the filenames in our input. So in this case, we have a number of image files, and the PATH_BLOB specifies how to find these filenames. The second thing we do is we want to shuffle the filenames because in training typically you want to shuffle the input data as it's coming in. And we apply a transformation called the map transformation. And we provided this get_bytes_and_label custom function. And what this function does is that it's going to read the file one by one using the tf.io.read_file API. And it uses the filename path to compute the label and returns both of these. So these three steps here comprise the extract stage that I talked about before. We're reading the files and passing the file data. Note that this is just one way in which you can read data using tf.data, and there are a number of different APIs that you can use for other situations, which we have listed here. The most notable one is the TFRecordDataset API, which you would use if your data is in the TFRecord file format. And this is potentially the most performant format to use with tf.data. If you have your input in memory in a NumPy array-- let's say you can also use the from_tensor_slices API to convert it into a tf.data data set and do subsequent transformations like this. The next thing we do is another map transformation to now take this raw image data but convert it and do some processing to make it amenable for our training task. So we have this process_image function here which we provide to this map transformation. And if you take a look, this could look something like this. We do some decoding of the image data, and then we apply some image-processing transformation such as resizing, flipping, normalization, and so on. The key things to note here are these transformations are actually very similar and correspond one to one to what you saw before in the Python version, but the big difference here is that now instead you're using TensorFlow ops to do those transformations. The second thing is that tf.data will take the custom function that you specify to the map transformation and it will trace it as a TensorFlow graph and run it in the C++ runtime instead of Python. And this is what allows it to make it much more efficient than Python. And finally we use the batch transformation to batch the elements of your input into mini batches, and this is a very common practice for training efficiency in ML tasks. So before I hand it off to Taylor to talk about how using tf.data can improve the performance in our MobileNet example, I want to walk through a few more advanced performance optimization tips for tf.data. And I'll go through them quickly, but you can also read about them in much more detail on the page that's listed here. The first optimization that I want to talk about is pipelining. And this is conceptually very similar to any other software pipelining that you might be aware of. The high-level idea here is that when you're training a single batch of data on your accelerator, we want to use the CPU resource at the same time to process and prepare the next batch of data. What this will do is that when the next training step starts, we don't have to wait for the next batch of data to be prepared. It will automatically be there, and this can reduce the overall training time significantly. To use software pipelining in tf.data is very simple. You can use the prefetch transformation as shown here. The second optimization that I want to talk about is paralyzing the transformation stage. By default, the map transformation will apply the custom function that you provide to each element of your input data set in sequence. But if there is no dependency between these elements, there's no reason to do this in sequence, right? So you can parallelize this by passing the num_parallel_calls argument to the map transformation. And this indicates to the tf.data runtime then it should run these map operations on your elements of the data set in parallel. And the third optimization that I want to talk about is parallelizing the extraction stage. So similar to how transforming your elements in sequence can be slow, similarly reading files one by one can be slow as well. So, of course, you want to parallelize it. And in this example, since we are using a map transformation to read our files, the way you do it is actually very similar to what we just saw. You add a num_parallel_calls argument to the map function that you have to read your files. Note that if you're using one of the built-in file readers such as the TFRecordDataset, you can also provide very similar arguments to that in order to parallelize the file reading there. So if you've been paying close attention, you'll notice that we have these magic numbers X, Y, and Z on the slides, and you might be wondering, how do you determine the optimal values of these? And in reality, it's actually not very straightforward to compute the optimal values of these parameters because if you set them too low, you might not be using enough parallelism in your system. And if you set them too high, it might lead to contention and actually have the opposite effect of what do you want. Fortunately, tf.data makes it really easy for you to specify these. Instead of specifying specific values for these X, Y, Z arguments, you can simply use this constant tf.data experimental AUTOTUNE. And what this does is it indicates to the tf.data runtime that it should do the autotuning for you and determine the optimal values for these arguments based on your workload, your environment, your setup, and so on. So that's all for tf.data. I'm now going to hand it off to Taylor to talk about what kind of performance benefits you can see. TAYLOR ROBIE: Thanks, Priya. So on the right here we can see the timeline before and after we add tf.data. So before we have this long stall in between our training steps where we're waiting for data. Once we've used tf.data, you'll note two things about this timeline after. The first is that the training step begins immediately after the previous one, and that's because tf.data was actually preparing the upcoming batch while the previous training step was happening. The second is that before, it actually took longer than the time of a batch in order to prepare the next batch, whereas now we see no large stalls in the timeline. The reason is that tf.data, because of the native parallelism, can now produce batches much more quickly than the training can consume them. And you can see this manifest in our throughput with more than a 2x improvement. But sometimes there are even more stalls. So if we zoom in on the timeline of one of our training steps, we'll see that there are a number of these very small gaps, and these are launch overheads. So if we further look at different portions of the GPU stream, near the end, this is what a healthy profile looks like. You have one op runs and then finishes, and immediately the next op can begin, and this results in an efficient and saturated accelerator. On the other hand, in some of the earlier parts of the model, an op is scheduled on the GPU. The GPU immediately chews through it, and then it simply waits idle for Python to enqueue the next op. And this leads to very poor accelerator utilization and efficiency. So what can we do? Well, if we back to our training step, we can simply add the tf.function decorator. What this will do is this will trace the entire training step into a tf.Graph, which can then be run very efficiently from the tf runtime. And this is the only change needed. And now if we look at our timeline, we see that all of the scheduling is done in this single, much faster op on the CPU, and then the work appears on the GPU as before. And this also allows us to use the rich library of graph optimizations that were developed in TensorFlow 1. And you can see this is again almost another factor-of-two improvement, and it's pretty obvious in the timeline on the right why. Whereas before tf.function when you were running everything eagerly we had all of these little gaps waiting for ops, now once it's compiled into a graph and launched from the C++ runtime, it's able to do a much better job keeping up with the GPU. So the next optimization that I'm going to talk about is XLA, which stands for Accelerated Linear Algebra. To understand how XLA works, we have just this simple example of a graph with some skip connections. What XLA will do is it will cluster that graph into subgraphs and then compile the entire subgraph into a very efficient fused kernel. And there are a number of performance gains from using these fused kernels. So the first is that you get much more efficient memory access because the kernel can use data while it's hot in the cache as opposed to pushing it all the way down the memory hierarchy and then bringing it all the way back. It also reduces the overhead from launch overhead-- this is the C++ executor launch overhead-- by running fewer ops but the same math. And finally, XLA is heavily optimized to target hardware. So it does things like use efficient hardware-specific vector instructions, specialize on shapes, and choose layouts so that the hardware is able to have very efficient access patterns. And it's quite straightforward to enable. You simply use this tf.config.optimizer.set_jit flag. And this will cause every tf.function that runs to attempt to compile and run with XLA. And you can see that, in our example, it's a very stark improvements. It's about a 2 and 1/2 x improvement in throughput. The one caveat with XLA is that a lot of the optimizations that it uses are based on specializing on shapes. So XLA needs to recompile every time it sees new shapes. So if you have an extremely dynamic model-- for instance, the shapes are different each batch-- you might actually wind up spending more time on the XLA compile than you gain back from the more efficient computation. So you should definitely try out XLA, but if you see performance regression rather than performance gain, it's likely that that's the reason. So next we're going to talk about mixed precision. If we want to go even faster, then we can give up some of our numeric stability in order to obtain faster training performance. So here I have the IEEE float32, which is the default numeric representation in TensorFlow, but there are also two half-precision formats that are relevant. The first is bfloat16 where we keep all of the exponent and simply chop off 16 bits of mantissa. And the other is float16 where we give up both some exponent in exchange for keeping a little bit more mantissa. And what's important about these formats is that they actually have native hardware support. So TPU has hardware support for very efficient bfloat16 operations, and GPUs have support for very efficient float16 operations. So if we can formulate our computation in these reduced-precision formats, we can potentially get very high speed-ups from the hardware. In order to enable this, we need to do a couple of things. First, we need to choose a loss_scale. So what is a loss_scale? A loss_scale is a constant that's inserted into the computation which doesn't change the mathematics of the computation, but it does change the numerics. And so this is a knob where the runtime can adjust the computation to keep it in a numerically stable range. The next thing we'll want to do is we want to set a mixed_precision policy. And this will tell Keras that it should cast tensors as they flow through the model in order to make sure that the computation is actually happening in the correct floating-point representation. In a custom training loop, we'll want to wrap our optimizer in this loss_scale optimizer, and this is the hook by which the loss_scale is inserted. So as training happens, TensorFlow will do a dynamic adjustment of this loss_scale to balance vanishing gradients from FP16 under flow while still preventing NaNs from FP16 overflow. If you're using the Model.fit workflow, this will be done for you. And finally, we generally need to increase our batch size when doing this mixed-precision training. The reason is that mixed precision makes computing a single example much less expensive. So if we just turn on mixed precision, we can go from a saturated accelerator in float32 to an underutilized accelerator in float16. So by increasing the batch size, we can go back to filling all of the hardware registers. And we can see this in our example. If we just turn on float16, there's actually no improvement in performance. But if we then increase our batch size, then we see a very substantial improvement in performance. It is worth noting that because we've both reduced the numeric precision and changed the batch size, this can require that you retune the hyperparameters. And then finally if we look at what are the remaining bits of performance, so about 60% of what's left is actually the copy from the host CPU to the GPU. Now in a little bit Priya's going to talk about distribution, and one of the things that the distribution-aware code in TensorFlow will do is it will automatically pipeline this prefetch. So you'll actually get that 60% for free. And then finally you have hand tuning. So you can give up some of the numeric stability. You can mess with thread pools. You can manually try and optimize layouts. This is included largely for completeness in case you're curious what real low-level hand tuning looks like. But for most cases, given that we can get the vast majority of the performance with just simple, idiomatic, easy-to-use APIs, this very fine hand tuning is not recommended. But once we've actually saturated a single device, we need to do something else to get another order-of-magnitude improvement. And so Priya's going to talk to you about that. Thanks, PRIYA GUPTA: Taylor. All right, so Taylor talked about a number of different things you can do to get the maximum performance out of a single machine with a single GPU. But if you want your training to go even faster, you need to start scaling out. So maybe you want to add more GPUs to your machine, or perhaps you want to train on multiple machines in a cluster, or maybe you want to use specialized hardware such as cloud TPUs. TensorFlow provides distribution strategy API to allow you to do just that. We built this API with three key goals in mind. The first goal is ease of use. We want users to be able to use the distribute.Strategy API with very little changes to their code. The second goal is to give great performance out of the box. We don't want users to have to change their training code or tune a lot of knobs to get the maximum efficiency out of their hardware resources. And finally, we want this API to work in a variety of different situations. So for instance, if you want to scale out to multiple GPUs or multiple machines or TPUs, or if you want just different distributed-training architectures such as synchronous or asynchronous training, or you're using different types of APIs-- so maybe you're using high-level Keras APIs like Model.fit, or you have a custom training loop as in our example earlier. We want distribution strategy to work in all of these potential cases. There are a number of different ways in which you can use this API, and we've listed them here in the order of increasing complexity. The first one is if you're using Keras high-level API model.fit for your training loop and you want to distribute your training in that setup. The second use case is what we've been talking about in this talk so far where you have a custom training loop and you want to scale out your training. The third use case is maybe you don't have a specific training program, but maybe you're writing a custom layer or a custom library and you want to make it distribution aware. And finally, maybe you're experimenting with a new distributed-training architecture and you want to create a new strategy. In the talk here, I'm only going to talk about the first two cases. So let's begin. The first use case we'll talk about is if you have a single machine with multiple GPUs and you want to scale up your training to this situation. For this setup, we provide MirroredStrategy. MirroredStrategy implements synchronous training across multiple GPUs on a single machine. The entire computation of your model would be replicated on each GPU. All the variables of your model would be replicated on each GPU as well, and they will be kept in sync using all-reduce. Let me go through step by step to talk about what the synchronous training looks like. There was some gray boxes around these which are not really visible, but let's say in our example we have two devices or two GPUs, Device 0 and 1, and we have a very simple model with two simple layers, Layer A and B. Each layer has a single variable. As you can see, the variables are mirrored or replicated on these two devices. So in our forward pass, we'll give a different slice of our input data to each of these devices. And in the forward pass, they will compute the logits using the local copy of the variables on these devices. In the backward pass, each device will then compute the gradients, again using the local copy. Once each device has computer the gradients, they'll communicate with each other to aggregate these gradients. And this is where all-reduce that I mentioned before comes in. All-reduce is a special class of algorithms that can be used to efficiently aggregate tensors such as gradients across different devices, and it can reduce the overhead of such synchronization by quite a bit. There are a number of different such algorithms available, and some hardware vendors such as Nvidia also provide specialist implementations of all-reduce for their hardware, such as the NCCL algorithm. So once these gradients have been aggregated, the aggregated result would be available on each device, and each device can update its local copy of the variable using this aggregated tensors. So in this way, both the devices are kept in sync, and the next forward pass doesn't begin until all the variables have been updated. So now let's take a look at some code to see how you can use MirroredStrategy to scale up your training. As I mentioned, we'll talk about two types of use cases. The first one is if you're using the Keras high-level API, and then we'll come back to the custom-training-loop example after this. So the code here is some skeleton code to train the same MobileNetV2 model but this time using the model.compile and fit API in Keras. In order to change this code to use MirroredStrategy, all you need to do is add these two lines of code. The first thing you do is create a MirroredStrategy object, and the second thing is you move the rest of your training code inside the scope of this strategy. Putting things inside the scope lets it take control of things like variable creation. So any variables that you create under the scope of the strategy will now be mirrored variables. You don't need to make any other changes to your code in this case because we've already made components of TensorFlow distribution aware. So for instance, in this case we have the optimizer which is distribution aware as well as compile and fit. The case we just saw was the simplest way in which you can create a MirroredStrategy, but you can also customize it. So let's say by default it will use all the GPUs available on your machine for training, but if you want to only use specific ones, you can specify them using the devices argument. You can also customize what all-reduce algorithm you want to use by using the cross_device_ops argument. So now we've seen how to use MirroredStrategy when using the high-level Keras model.fit API. Now let's go back to our custom-training-loop example from before and see how you can modify that to use MirroredStrategy. And there's a little bit more code in this example because when you have a custom training loop, you have more control over what exactly you want to distribute, and so you need to do a little bit more work to distribute it as well. So here, this is the skeleton of the custom training loop from before. We have the model, the loss function, the optimizer, and you have your training step. And then you have your outer loop which iterates over your data and calls the training step. The first thing you need to do is the same as before. You create the MirroredStrategy object, and you move the creation of the model, the optimizer, and so on inside the scope of this strategy. And the purpose of this is the same as before. But as I mentioned, you need to do a few more things, so this is not sufficient in this case. And let's go over each of them one by one. The first thing you need to do is to distribute your data typically. And if you're using tf.data data sets as your input, this is straightforward. All you need to do is call strategy.experimental distribute dataset on your data set, and this returns a distributed data set which you can then iterate over in a very similar manner as before. The second thing you need to do is to scale your loss by the global_batch_size, and this is very important so that the convergent characteristics of your model do not change. And we've provided a helper method in the nn library to do so. And the third thing you need to do is to specify which specific computation you want to replicate. So in this case, we want to replicate our training step on each replica. So you can use strategy.experimental_run_v2 API to provide the training-step function as well as your distributed input, and you wrap this whole thing in another tf.function because we want all of this to run as a tf.Graph as well. So those are all the code changes you need to make in order to take the custom training loop from before and now run it in a distributed fashion using distribution strategy API. So let's go back and take a look at what kind of scaling you can see. So just out of the box adding this MirroredStrategy to the MobileNet V2 example from before, we're able to get 80% when going from one GPU to eight GPUs on a single machine. It's possible to get even more scaling out of this by doing some manual optimizations which we won't be going into today. To give another example of the scaling, we ran multi-GPU training for ResNet50, which is a very popular image-classification benchmark. And in this example, we also used the FP16 and XLA techniques that Taylor talked about before. And here you can see going from one to eight GPUs we were able to get 85% scaling. And this example was using the Keras model.fit API instead of the custom-training example. If you're interested, you can look at the link in the bottom and you can try out this model yourself. Another example is using the transformer language model to show that this is not just for images. We're able to scale up other types of models as well. And in this case, we're able to get more than 90% scaling when running from one to eight GPUs. So, so far we've been talking about scaling up to multiple GPUs on a single machine, but most likely you would want to scale even further to multiple machines, maybe even multiple GPUs, or perhaps just with CPUs. For these use cases, we provide the MultiWorkerMirroredStrategy. As the name suggests, this is very similar to the MirroredStrategy that we've been talking about. It also implements synchronous training but this time across all the different machines in your cluster. In order to do the altered use, it uses a new type of op in TensorFlow called collective ops. Collective op is a single op in the TensorFlow graph which can determine the best altered use to use based on a number of different factors such as the network topology, the type of communication available between the different machines, as well as tensor sizes. It can also implement optimization such as tensor fusion. So for instance if you have a lot of small tensors that you want to aggregate, it may batch them up into a single tensor in order to reduce the load on the network. So how can you use MultiWorkerMirroredStrategy? It's actually very similar to MirroredStrategy. You create a strategy object like so, and the rest of your training code actually remains the same, so I've omitted it here. Once you have the strategy object, you can put the rest of your code in strategy.scope, and you're good to go. One more thing you need to do, however, in the multi-worker case is to give us information about your cluster. And one way to do that for multi-worker strategy is using the TF_CONFIG environment variable. TF_CONFIG, you might be familiar with this environment variable if you've used distributor training using estimator in TensorFlow 1. So it basically consists of two components. The first is the cluster which gives us information about your entire cluster. So here we have three different workers at these hosts and port. And the second piece is information about the specific worker. So this is saying this is worker one. Once you provide this TF_CONFIG, the task part would be different on each worker, but their training code can remain the same. And distribution strategy will read this TF_CONFIG environment variable and figured out how to communicate to the different workers in your cluster. So we've been talking about multiple machines, multiple GPUs. What about TPUs? You've probably heard about TPUs in this conference somewhere else before. TPUs are custom hardware built by Google to accelerate machine-learning workloads. You can use them through cloud TPUs, or you can even try them out in Codelab. Distribution strategy provides TPUStrategy for you to be able to scale up your training to TPUs. It's very similar to MirroredStrategy in that it implements synchronous training, and it uses the cross_replicate_sum API to do the all reduce across the TPU cores. You can use this API to scale your training to a single TPU or a slice of a pod or a full pod as well. And if you want to try out TPUStrategy, you can try it with TensorFlow nightlies, or you can wait for the 2.1 stable release for this as well. Let's look at the code to see how you can use TPUStrategy. The first thing you need to do is to provide information about the TPU cluster. So the way you do that is create a TPUClusterResolver and give it the name or address of your TPU. Once you have a ClusterResolver, you can use the experimental_connect_to_cluster API to connect to this cluster and also initialize the TPU system. Once you've done these three things, you can then create the strategy object and pass the ClusterResolver object to it. Once you have the strategy object, the rest of the code remains the same. You create the strategy to scope, and you put your training code inside that. So I won't be going into that here. So looking at this code, you can see that distribution strategy makes it really easy to switch from training onto multiple GPUs or multiple machines to specialized hardware like TPUs. So that's all I was going to talk about distribution strategy. To recap today, we talked about a few different things. We talked about tf.data to build simple and performant input pipelines. We talked about how you can use tf.function to run your training as a TensorFlow graph. We talked about how you can use XLA and mixed precision to improve your training speed even further. And finally, we talked about the DistrbuteStrategy API to scale out your training to more hardware. We have a number of resources on our website, a number of guides and tutorials, so we encourage you to check those out. And if you have any questions or you want to report bugs, please reach out to us on GitHub. We'll also be available in the expo hall after the talk if you have more questions. And finally, the notebook that we've been using in the talk is listed here. So if you want to take a picture, you can check it out later. That's all. Thank you so much for listening. [APPLAUSE] I guess we have some time, so we can take some questions. AUDIENCE: Hi Thanks for the nice presentation. So I have two questions, both regarding inference. So you mentioned how we can use TensorBoard for profiling to see how well our CPUs and GPUs are utilized. Can we do the same to see how well our inferences are working? TAYLOR ROBIE: Yes. So the profiler has no concept of training or inference. It just looks at ops as they execute. So if you're just running the forward pass, they'll show up all the same. PRIYA GUPTA: It also kind of depends on how you're doing inference. Are you doing it using Python APIs or are you-- AUDIENCE: SavedModel. So let's say I have a SavedModel format, and I want to use that in TensorFlow. PRIYA GUPTA: So you're not really using any of the Python APIs, right? AUDIENCE: So I use Python APIs, but basically to-- PRIYA GUPTA: So you're doing a SavedModel in Python, and then-- yeah, then you can use TensorFlow as well. AUDIENCE: Regarding mixed precision, will it help in inference as well and how to we enable that? TAYLOR ROBIE: It can potentially. Typically for inference, the integer low-precision formats tend to be more important than the floating-point ones. If you really care about very fast inference with low precision, I would suggest that you check out TF Lite because they have a lot of quantization support. AUDIENCE: Thank you. AUDIENCE: Hello. I had just one question about how to save your model when you are training in a DistributeStrategy because your code is replicated in all the nodes. What's the official way to do this? PRIYA GUPTA: Yes, so you want to save a full SavedModel. AUDIENCE: Yes. PRIYA GUPTA: Yeah, so you can use the standard APIs, just the tf.saved_model.save. And what it will do is it will not save the replicated model. It'll save a single copy of the model. So then it's like any other SavedModel, and you can then reload it. If you load it inside another strategy again, then you can then-- it will get replicated at that point, or you can use it for inference as another SavedModel. AUDIENCE: Because what I experienced was that all the nodes tried to save the model at the same time, and then it was an issue between a race condition with the placement where the models were saved. PRIYA GUPTA: And so you got an error? AUDIENCE: Yes, it crashed. PRIYA GUPTA: Maybe we can talk offline. It's supposed to work, but there could be bugs. AUDIENCE: Thanks for the talk. I have a quick question about mixed position and the XLA. Will the API automatically use the hardware-related features? Like for example, in mixed precision if you have Volta machine, it will be able to use Volta machine's [INAUDIBLE]?? TAYLOR ROBIE: Yes. For instance, on Volta, it will automatically try to use the Tensor Core. And in fact XLA and mixed precision synergize very well because XLA-- or sorry, mixed precision inserts a bunch of casts, and then XLA will go along and it will actually optimize a lot of those away. It will try and make the layout more amenable to mixed precision. So XLA tends to talk to the hardware at a very low level. AUDIENCE: So basically from user point of view it's transparent. If I run the same code on the CPU, on the old GPU, new GPU, it will automatically try to use the hardware ability? TAYLOR ROBIE: Yes. SPEAKER: I'm really sorry. We kind of have to stop this now for the livestreaming. PRIYA GUPTA: We can take more questions outside the hall.
B1 training tf data tf data api strategy Performant, scalable models in TensorFlow 2 with tf.data, tf.function & tf.distribute (TF World '19) 1 0 林宜悉 posted on 2020/03/31 More Share Save Report Video vocabulary