Placeholder Image

Subtitles section Play video

  • TAYLOR ROBIE: I'm Taylor.

  • I'm a engineer on the TF Keras team.

  • PRIYA GUPTA: And I'm Priya.

  • I'm also an engineer on that in TensorFlow team.

  • I lead the distribution strategy team.

  • TAYLOR ROBIE: And today we're going

  • to be talking talking about writing performant models

  • in TensorFlow 2.

  • Just to give a brief outline of the talk,

  • we're, first, going to talk about the structure

  • of a training loop in TensorFlow 2.

  • And then we're going to talk about the different APIs

  • that you can use to make your training more performant.

  • The first API that we're be talking about is tf.data.

  • This is the API for writing high performance pipelines

  • to avoid various sorts of stalls and make sure

  • that your training always has data

  • as it's ready to consume it.

  • Next we going to be talking about tf.function.

  • This is how you can formulate your training step

  • in a way that is efficient so that the runtime can

  • run with very minimal overhead.

  • And, finally, we're going to be talking about tf.distribute

  • which is how to target your training to specific hardware

  • and also scale out to more hardware.

  • In order to demonstrate all of these APIs,

  • we're going to be walking through a case study

  • starting with the most naive implementation

  • and then progressively adding more performant APIs

  • and looking at how that helps our training.

  • So we're going to be training a cat classifier.

  • Because we're an internet company, that's what we do.

  • And we're going to be training it on MobileNet V2.

  • This task was largely chosen arbitrarily

  • just as a representative task, so these lessons

  • should be broadly applicable to your workflows as well.

  • And there's a notebook.

  • And I'll put the link back at the end of the slide

  • so you're encouraged to take a look afterward and compare it

  • back to the talk.

  • So let's get started.

  • A brief overview of training loops in TensorFlow 2.

  • We're going to be writing a custom training loop.

  • And the reason is that we want to look

  • at the mechanics of what the system is actually doing

  • and how we can make it performant.

  • If you're used to using a higher level API,

  • like Keras model fit, these ap-- these lessons are still

  • broadly applicable it's just that some of this

  • will be done automatically for you.

  • So if we want to show why these APIs are important,

  • it's useful to really dig into the internals of the training

  • loop.

  • So because we're training on MobileNet

  • we've chosen to use the Keras Applications MobileNet

  • V2 which is just a canned MobileNet V2 implementation.

  • We're training a classifier.

  • So we're going to be using sigmoid cross entropy

  • with logits.

  • This is a numerically stable cross entropy function.

  • And then, finally, we're going to be using the Keras SGD

  • optimizer.

  • So if we were to call our model on our features,

  • in this case it's an image classification

  • task so our features are an image,

  • we would get logics which is the predicted class.

  • If we then apply our loss function

  • comparing the labels and the logits,

  • this gives us a scalar loss which

  • we could then use to optimize and update our model.

  • We want to call all of this under a tf.GradientTape.

  • What the gradient tape does is it watches the computations

  • as they're performed, and maintains the metadata,

  • and keeps the activations alive.

  • This is what allows us to take gradients

  • which we can then pass to the optimizer

  • to update our model weights.

  • And, finally, if we wrap all of this into a step

  • and then iterate over our data in many batches,

  • applying this step in each mini batch,

  • this comprises the general structure of our training loop.

  • On the data side, we're going to start out with just a straw man

  • Python generator to show the different parts of the data

  • pipeline and then we'll look at its performance a little bit

  • later.

  • So what do we need to do?

  • Well, first, we'll want to shuffle

  • the data so that we see a different ordering each epoch.

  • We'll need to load all of the images from disk into memory

  • so that they can be consumed by the training process.

  • We'll also want to do some augmentation.

  • So we'll want to resize the images from their native format

  • down-- back to the format that the model expects.

  • And then we'll do some task-specific optimization.

  • So, in this case, we'll randomly flip the images

  • and we'll add some pixel level noise

  • to make our training a little bit more robust.

  • And, finally, we'll need to batch these.

  • So the generator that I've shown is

  • producing examples one at a time,

  • but we'll need to collect them into mini batches.

  • And the key operation for that batching operation

  • is a concatenation.

  • It's worth noting that different parts of the data pipeline

  • will stress different parts of the system.

  • So for loading the disk-- this is an I/O-bound task

  • and we'll generally want to consume this I/O as fast

  • as possible so that we're not constantly waiting for images

  • to arrive from disk one at a time.

  • Secondly, the augmentations tends

  • to be rather CPU intensive because we're

  • doing various sorts of math in our augmentation.

  • And, finally, batching tends to be somewhat memory

  • intensive task because we're copying examples

  • from their original location into the memory

  • buffer of our mini batch.

  • So now that we have a scaffolding,

  • let's just go ahead and run this.

  • We're going to be running on an Nvidia V100 GPU.

  • By default, TensorFlow will attempt to place ops

  • on the optimal device.

  • And because we're doing large matrix multiply operations,

  • it will place them on the GPU.

  • However, we find that our training, at the start,

  • is pretty lackluster.

  • We're under 100 images a second.

  • And if we actually look at our device utilization,

  • it's well under 20%.

  • So what's going on here?

  • Well, in order to determine that,

  • we can actually do some profiling.

  • And TensorFlow 2 TensorBoard makes this quite easy to do.

  • You might already be familiar with TensorBoard.

  • This is the standard way of monitoring training as it runs.

  • So you might be, for instance, familiar

  • with the scalars tab which will show things

  • like losses in metrics as your training continues.

  • Well, in TensorFlow 2 there's a new tab

  • and that is the profiler tab.

  • And what this does is it looks at each op

  • as it runs in TensorFlow and shows you a timeline of how

  • you're training is progressing.

  • And we're going to look in a little bit more detail

  • about how to read and interpret these timelines.

  • Before we do, however, let's talk about how to enable it.

  • So it's quite straightforward.

  • You simply turn on the profiler.

  • This will signal to the runtime that it should keep

  • records of ops as it runs them.

  • Write your program as normal, and then simply

  • export this trace, and this will put it in a format

  • that TensorBoard can consume and display for you.

  • So this is a timeline.

  • Moving left to right, we have time

  • and then each row is a different thread.

  • And you can see that they're annotated with device.

  • So the top row is the CPU and then

  • the bottom ones are the various GPU threads.

  • And, in this case, we're looking at three different training

  • steps.

  • Each vertical line is an individual op

  • that's been run by TensorFlow.

  • So you can see that the ops are scheduled from the CPU

  • and then they appear on the GPU stream.

  • And the GPU actually executes the ops and this is where

  • the heavy lifting is done.

  • But you can also see that we have

  • these large gaps in between where no ops are running

  • and our system is essentially stalled.

  • And this is because we have--

  • after each step we have to wait for our naive generator

  • to produce the next batch of data,

  • so this is obviously quite inefficient.

  • So Priya's going to tell you how we can improve that.

  • PRIYA GUPTA: Thanks, Taylor.

  • So as Taylor just showed us, writing your own input pipeline

  • in Python to read data and transform

  • it can be pretty inefficient.

  • TensorFlow provides the tf.data API

  • to allow you to easily build performance and scalable input

  • pipelines.

  • You can think of the tf.data input pipeline

  • as an ETL process.

  • So the first stage is the extract stage

  • where we read the data from, let's say, network storage

  • or from your local disk, and then you potentially

  • are parsing the file format.

  • The second stage is the transform stage,

  • which is where you take the input file data and transform

  • it into a form that's amenable to your ML computation.

  • So this could be image transformations

  • specific to your ML task, or there

  • could be generic transformations like shuffling or batching

  • that apply to a lot of ML tasks.

  • Once you transform this data, you

  • can then load it into your accelerator for your training.

  • So let's take a look at some code

  • to see how you can use tf.data to write the same input

  • pipeline that Taylor showed you before but in a much more

  • efficient way.

  • So there's a number of steps here,

  • and I'll go over them one by one.

  • The first one is we create a simple data

  • set consisting of all the filenames in our input.

  • So in this case, we have a number of image files,

  • and the PATH_BLOB specifies how to find these filenames.

  • The second thing we do is we want

  • to shuffle the filenames because in training typically

  • you want to shuffle the input data as it's coming in.

  • And we apply a transformation called the map transformation.

  • And we provided this get_bytes_and_label custom

  • function.

  • And what this function does is that it's going to read

  • the file one by one using the tf.io.read_file API.

  • And it uses the filename path to compute the label

  • and returns both of these.

  • So these three steps here comprise the extract stage

  • that I talked about before.

  • We're reading the files and passing the file data.

  • Note that this is just one way in which you

  • can read data using tf.data, and there

  • are a number of different APIs that you

  • can use for other situations, which we have listed here.

  • The most notable one is the TFRecordDataset API,

  • which you would use if your data is in the TFRecord file format.

  • And this is potentially the most performant format

  • to use with tf.data.

  • If you have your input in memory in a NumPy array--

  • let's say you can also use the from_tensor_slices API

  • to convert it into a tf.data data set

  • and do subsequent transformations like this.

  • The next thing we do is another map transformation

  • to now take this raw image data but convert it and do

  • some processing to make it amenable for our training task.

  • So we have this process_image function here

  • which we provide to this map transformation.

  • And if you take a look, this could look something like this.

  • We do some decoding of the image data,

  • and then we apply some image-processing transformation

  • such as resizing, flipping, normalization, and so on.

  • The key things to note here are these transformations

  • are actually very similar and correspond one to one to what

  • you saw before in the Python version,

  • but the big difference here is that now instead you're

  • using TensorFlow ops to do those transformations.

  • The second thing is that tf.data will take the custom function

  • that you specify to the map transformation and it will

  • trace it as a TensorFlow graph and run it in the C++ runtime

  • instead of Python.

  • And this is what allows it to make it much more efficient

  • than Python.

  • And finally we use the batch transformation

  • to batch the elements of your input into mini batches,

  • and this is a very common practice for training

  • efficiency in ML tasks.

  • So before I hand it off to Taylor

  • to talk about how using tf.data can improve

  • the performance in our MobileNet example,

  • I want to walk through a few more advanced performance

  • optimization tips for tf.data.

  • And I'll go through them quickly,

  • but you can also read about them in much more detail

  • on the page that's listed here.

  • The first optimization that I want to talk about

  • is pipelining.

  • And this is conceptually very similar to any other software

  • pipelining that you might be aware of.

  • The high-level idea here is that when

  • you're training a single batch of data on your accelerator,

  • we want to use the CPU resource at the same time

  • to process and prepare the next batch of data.

  • What this will do is that when the next training step starts,

  • we don't have to wait for the next batch of data

  • to be prepared.

  • It will automatically be there, and this

  • can reduce the overall training time significantly.

  • To use software pipelining in tf.data is very simple.

  • You can use the prefetch transformation as shown here.

  • The second optimization that I want to talk about

  • is paralyzing the transformation stage.

  • By default, the map transformation

  • will apply the custom function that you

  • provide to each element of your input data set in sequence.

  • But if there is no dependency between these elements,

  • there's no reason to do this in sequence, right?

  • So you can parallelize this by passing

  • the num_parallel_calls argument to the map transformation.

  • And this indicates to the tf.data runtime

  • then it should run these map operations

  • on your elements of the data set in parallel.

  • And the third optimization that I want to talk about

  • is parallelizing the extraction stage.

  • So similar to how transforming your elements in sequence

  • can be slow, similarly reading files one by one

  • can be slow as well.

  • So, of course, you want to parallelize it.

  • And in this example, since we are using a map transformation

  • to read our files, the way you do it

  • is actually very similar to what we just saw.

  • You add a num_parallel_calls argument to the map function

  • that you have to read your files.

  • Note that if you're using one of the built-in file

  • readers such as the TFRecordDataset,

  • you can also provide very similar arguments

  • to that in order to parallelize the file reading there.

  • So if you've been paying close attention,

  • you'll notice that we have these magic numbers X, Y,

  • and Z on the slides, and you might be wondering,

  • how do you determine the optimal values of these?

  • And in reality, it's actually not

  • very straightforward to compute the optimal values

  • of these parameters because if you set them too low,

  • you might not be using enough parallelism in your system.

  • And if you set them too high, it might lead to contention

  • and actually have the opposite effect of what do you want.

  • Fortunately, tf.data makes it really easy

  • for you to specify these.

  • Instead of specifying specific values

  • for these X, Y, Z arguments, you can simply

  • use this constant tf.data experimental AUTOTUNE.

  • And what this does is it indicates to the tf.data

  • runtime that it should do the autotuning for you

  • and determine the optimal values for these arguments

  • based on your workload, your environment, your setup, and so

  • on.

  • So that's all for tf.data.

  • I'm now going to hand it off to Taylor

  • to talk about what kind of performance benefits

  • you can see.

  • TAYLOR ROBIE: Thanks, Priya.

  • So on the right here we can see the timeline before and after

  • we add tf.data.

  • So before we have this long stall

  • in between our training steps where we're waiting for data.

  • Once we've used tf.data, you'll note two things

  • about this timeline after.

  • The first is that the training step begins immediately

  • after the previous one, and that's

  • because tf.data was actually preparing the upcoming batch

  • while the previous training step was happening.

  • The second is that before, it actually took

  • longer than the time of a batch in order

  • to prepare the next batch, whereas now we

  • see no large stalls in the timeline.

  • The reason is that tf.data, because

  • of the native parallelism, can now produce batches

  • much more quickly than the training can consume them.

  • And you can see this manifest in our throughput with more than a

  • 2x improvement.

  • But sometimes there are even more stalls.

  • So if we zoom in on the timeline of one of our training steps,

  • we'll see that there are a number of these very

  • small gaps, and these are launch overheads.

  • So if we further look at different portions of the GPU

  • stream, near the end, this is what

  • a healthy profile looks like.

  • You have one op runs and then finishes,

  • and immediately the next op can begin,

  • and this results in an efficient and saturated accelerator.

  • On the other hand, in some of the earlier parts of the model,

  • an op is scheduled on the GPU.

  • The GPU immediately chews through it,

  • and then it simply waits idle for Python

  • to enqueue the next op.

  • And this leads to very poor accelerator utilization

  • and efficiency.

  • So what can we do?

  • Well, if we back to our training step,

  • we can simply add the tf.function decorator.

  • What this will do is this will trace the entire training

  • step into a tf.Graph, which can then

  • be run very efficiently from the tf runtime.

  • And this is the only change needed.

  • And now if we look at our timeline,

  • we see that all of the scheduling

  • is done in this single, much faster op on the CPU,

  • and then the work appears on the GPU as before.

  • And this also allows us to use the rich library

  • of graph optimizations that were developed in TensorFlow 1.

  • And you can see this is again almost

  • another factor-of-two improvement,

  • and it's pretty obvious in the timeline on the right why.

  • Whereas before tf.function when you were running everything

  • eagerly we had all of these little gaps waiting for ops,

  • now once it's compiled into a graph and launched from the C++

  • runtime, it's able to do a much better job keeping up with

  • the GPU.

  • So the next optimization that I'm going to talk about

  • is XLA, which stands for Accelerated Linear Algebra.

  • To understand how XLA works, we have just this simple example

  • of a graph with some skip connections.

  • What XLA will do is it will cluster that graph

  • into subgraphs and then compile the entire subgraph into a very

  • efficient fused kernel.

  • And there are a number of performance gains

  • from using these fused kernels.

  • So the first is that you get much more efficient memory

  • access because the kernel can use data

  • while it's hot in the cache as opposed to pushing

  • it all the way down the memory hierarchy

  • and then bringing it all the way back.

  • It also reduces the overhead from launch overhead--

  • this is the C++ executor launch overhead--

  • by running fewer ops but the same math.

  • And finally, XLA is heavily optimized to target hardware.

  • So it does things like use efficient hardware-specific

  • vector instructions, specialize on shapes,

  • and choose layouts so that the hardware is able to have

  • very efficient access patterns.

  • And it's quite straightforward to enable.

  • You simply use this tf.config.optimizer.set_jit

  • flag.

  • And this will cause every tf.function

  • that runs to attempt to compile and run with XLA.

  • And you can see that, in our example,

  • it's a very stark improvements.

  • It's about a 2 and 1/2 x improvement in throughput.

  • The one caveat with XLA is that a lot

  • of the optimizations that it uses

  • are based on specializing on shapes.

  • So XLA needs to recompile every time it sees new shapes.

  • So if you have an extremely dynamic model-- for instance,

  • the shapes are different each batch--

  • you might actually wind up spending more time on the XLA

  • compile than you gain back from the more efficient computation.

  • So you should definitely try out XLA,

  • but if you see performance regression rather than

  • performance gain, it's likely that that's the reason.

  • So next we're going to talk about mixed precision.

  • If we want to go even faster, then we

  • can give up some of our numeric stability

  • in order to obtain faster training performance.

  • So here I have the IEEE float32, which

  • is the default numeric representation in TensorFlow,

  • but there are also two half-precision formats

  • that are relevant.

  • The first is bfloat16 where we keep all of the exponent

  • and simply chop off 16 bits of mantissa.

  • And the other is float16 where we give up both some exponent

  • in exchange for keeping a little bit more mantissa.

  • And what's important about these formats

  • is that they actually have native hardware support.

  • So TPU has hardware support for very efficient

  • bfloat16 operations, and GPUs have

  • support for very efficient float16 operations.

  • So if we can formulate our computation

  • in these reduced-precision formats,

  • we can potentially get very high speed-ups from the hardware.

  • In order to enable this, we need to do a couple of things.

  • First, we need to choose a loss_scale.

  • So what is a loss_scale?

  • A loss_scale is a constant that's

  • inserted into the computation which doesn't change

  • the mathematics of the computation,

  • but it does change the numerics.

  • And so this is a knob where the runtime can

  • adjust the computation to keep it

  • in a numerically stable range.

  • The next thing we'll want to do is

  • we want to set a mixed_precision policy.

  • And this will tell Keras that it should cast tensors

  • as they flow through the model in order

  • to make sure that the computation is actually

  • happening in the correct floating-point representation.

  • In a custom training loop, we'll want

  • to wrap our optimizer in this loss_scale optimizer,

  • and this is the hook by which the loss_scale is inserted.

  • So as training happens, TensorFlow

  • will do a dynamic adjustment of this loss_scale

  • to balance vanishing gradients from FP16 under flow

  • while still preventing NaNs from FP16 overflow.

  • If you're using the Model.fit workflow,

  • this will be done for you.

  • And finally, we generally need to increase our batch size

  • when doing this mixed-precision training.

  • The reason is that mixed precision

  • makes computing a single example much less expensive.

  • So if we just turn on mixed precision,

  • we can go from a saturated accelerator in float32

  • to an underutilized accelerator in float16.

  • So by increasing the batch size, we

  • can go back to filling all of the hardware registers.

  • And we can see this in our example.

  • If we just turn on float16, there's actually

  • no improvement in performance.

  • But if we then increase our batch size,

  • then we see a very substantial improvement in performance.

  • It is worth noting that because we've both

  • reduced the numeric precision and changed the batch size,

  • this can require that you retune the hyperparameters.

  • And then finally if we look at what

  • are the remaining bits of performance, so about 60%

  • of what's left is actually the copy

  • from the host CPU to the GPU.

  • Now in a little bit Priya's going

  • to talk about distribution, and one

  • of the things that the distribution-aware code

  • in TensorFlow will do is it will automatically

  • pipeline this prefetch.

  • So you'll actually get that 60% for free.

  • And then finally you have hand tuning.

  • So you can give up some of the numeric stability.

  • You can mess with thread pools.

  • You can manually try and optimize layouts.

  • This is included largely for completeness in case

  • you're curious what real low-level hand tuning

  • looks like.

  • But for most cases, given that we

  • can get the vast majority of the performance with just simple,

  • idiomatic, easy-to-use APIs, this very fine hand tuning

  • is not recommended.

  • But once we've actually saturated a single device,

  • we need to do something else to get

  • another order-of-magnitude improvement.

  • And so Priya's going to talk to you about that.

  • Thanks,

  • PRIYA GUPTA: Taylor.

  • All right, so Taylor talked about a number

  • of different things you can do to get the maximum performance

  • out of a single machine with a single GPU.

  • But if you want your training to go even faster,

  • you need to start scaling out.

  • So maybe you want to add more GPUs to your machine,

  • or perhaps you want to train on multiple machines in a cluster,

  • or maybe you want to use specialized

  • hardware such as cloud TPUs.

  • TensorFlow provides distribution strategy API

  • to allow you to do just that.

  • We built this API with three key goals in mind.

  • The first goal is ease of use.

  • We want users to be able to use the distribute.Strategy API

  • with very little changes to their code.

  • The second goal is to give great performance out of the box.

  • We don't want users to have to change their training code

  • or tune a lot of knobs to get the maximum efficiency out

  • of their hardware resources.

  • And finally, we want this API to work

  • in a variety of different situations.

  • So for instance, if you want to scale out

  • to multiple GPUs or multiple machines or TPUs,

  • or if you want just different distributed-training

  • architectures such as synchronous or asynchronous

  • training, or you're using different types of APIs--

  • so maybe you're using high-level Keras APIs like Model.fit,

  • or you have a custom training loop as in our example earlier.

  • We want distribution strategy to work in all

  • of these potential cases.

  • There are a number of different ways

  • in which you can use this API, and we've listed them here

  • in the order of increasing complexity.

  • The first one is if you're using Keras high-level API

  • model.fit for your training loop and you

  • want to distribute your training in that setup.

  • The second use case is what we've

  • been talking about in this talk so far where you have a custom

  • training loop and you want to scale out your training.

  • The third use case is maybe you don't have a specific training

  • program, but maybe you're writing a custom

  • layer or a custom library and you want

  • to make it distribution aware.

  • And finally, maybe you're experimenting

  • with a new distributed-training architecture

  • and you want to create a new strategy.

  • In the talk here, I'm only going to talk about the first two

  • cases.

  • So let's begin.

  • The first use case we'll talk about

  • is if you have a single machine with multiple GPUs

  • and you want to scale up your training to this situation.

  • For this setup, we provide MirroredStrategy.

  • MirroredStrategy implements synchronous training

  • across multiple GPUs on a single machine.

  • The entire computation of your model

  • would be replicated on each GPU.

  • All the variables of your model would be replicated on each GPU

  • as well, and they will be kept in sync using all-reduce.

  • Let me go through step by step to talk

  • about what the synchronous training looks like.

  • There was some gray boxes around these which are not

  • really visible, but let's say in our example

  • we have two devices or two GPUs, Device 0 and 1,

  • and we have a very simple model with two simple layers, Layer

  • A and B. Each layer has a single variable.

  • As you can see, the variables are mirrored or replicated

  • on these two devices.

  • So in our forward pass, we'll give

  • a different slice of our input data to each of these devices.

  • And in the forward pass, they will

  • compute the logits using the local copy of the variables

  • on these devices.

  • In the backward pass, each device

  • will then compute the gradients, again using the local copy.

  • Once each device has computer the gradients,

  • they'll communicate with each other

  • to aggregate these gradients.

  • And this is where all-reduce that I

  • mentioned before comes in.

  • All-reduce is a special class of algorithms

  • that can be used to efficiently aggregate tensors

  • such as gradients across different devices,

  • and it can reduce the overhead of such synchronization

  • by quite a bit.

  • There are a number of different such algorithms

  • available, and some hardware vendors such as Nvidia

  • also provide specialist implementations of all-reduce

  • for their hardware, such as the NCCL algorithm.

  • So once these gradients have been aggregated,

  • the aggregated result would be available on each device,

  • and each device can update its local copy of the variable

  • using this aggregated tensors.

  • So in this way, both the devices are kept in sync,

  • and the next forward pass doesn't

  • begin until all the variables have been updated.

  • So now let's take a look at some code

  • to see how you can use MirroredStrategy to scale up

  • your training.

  • As I mentioned, we'll talk about two types of use cases.

  • The first one is if you're using the Keras high-level API,

  • and then we'll come back to the custom-training-loop example

  • after this.

  • So the code here is some skeleton code

  • to train the same MobileNetV2 model but this time using

  • the model.compile and fit API in Keras.

  • In order to change this code to use MirroredStrategy,

  • all you need to do is add these two lines of code.

  • The first thing you do is create a MirroredStrategy object,

  • and the second thing is you move the rest of your training

  • code inside the scope of this strategy.

  • Putting things inside the scope lets

  • it take control of things like variable creation.

  • So any variables that you create under the scope of the strategy

  • will now be mirrored variables.

  • You don't need to make any other changes

  • to your code in this case because we've already

  • made components of TensorFlow distribution aware.

  • So for instance, in this case we have

  • the optimizer which is distribution aware

  • as well as compile and fit.

  • The case we just saw was the simplest way

  • in which you can create a MirroredStrategy,

  • but you can also customize it.

  • So let's say by default it will use

  • all the GPUs available on your machine for training,

  • but if you want to only use specific ones,

  • you can specify them using the devices argument.

  • You can also customize what all-reduce algorithm

  • you want to use by using the cross_device_ops argument.

  • So now we've seen how to use MirroredStrategy

  • when using the high-level Keras model.fit API.

  • Now let's go back to our custom-training-loop example

  • from before and see how you can modify

  • that to use MirroredStrategy.

  • And there's a little bit more code in this example

  • because when you have a custom training loop,

  • you have more control over what exactly you want to distribute,

  • and so you need to do a little bit more work

  • to distribute it as well.

  • So here, this is the skeleton of the custom training loop

  • from before.

  • We have the model, the loss function, the optimizer,

  • and you have your training step.

  • And then you have your outer loop

  • which iterates over your data and calls the training step.

  • The first thing you need to do is the same as before.

  • You create the MirroredStrategy object,

  • and you move the creation of the model, the optimizer, and so

  • on inside the scope of this strategy.

  • And the purpose of this is the same as before.

  • But as I mentioned, you need to do a few more things,

  • so this is not sufficient in this case.

  • And let's go over each of them one by one.

  • The first thing you need to do is

  • to distribute your data typically.

  • And if you're using tf.data data sets as your input,

  • this is straightforward.

  • All you need to do is call strategy.experimental

  • distribute dataset on your data set,

  • and this returns a distributed data

  • set which you can then iterate over

  • in a very similar manner as before.

  • The second thing you need to do is

  • to scale your loss by the global_batch_size,

  • and this is very important so that

  • the convergent characteristics of your model do not change.

  • And we've provided a helper method in the nn library

  • to do so.

  • And the third thing you need to do

  • is to specify which specific computation

  • you want to replicate.

  • So in this case, we want to replicate our training

  • step on each replica.

  • So you can use strategy.experimental_run_v2

  • API to provide the training-step function as well

  • as your distributed input, and you wrap this whole thing

  • in another tf.function because we want all of this to run

  • as a tf.Graph as well.

  • So those are all the code changes

  • you need to make in order to take the custom training

  • loop from before and now run it in a distributed fashion using

  • distribution strategy API.

  • So let's go back and take a look at what kind of scaling

  • you can see.

  • So just out of the box adding this MirroredStrategy

  • to the MobileNet V2 example from before,

  • we're able to get 80% when going from one

  • GPU to eight GPUs on a single machine.

  • It's possible to get even more scaling out of this

  • by doing some manual optimizations which

  • we won't be going into today.

  • To give another example of the scaling,

  • we ran multi-GPU training for ResNet50,

  • which is a very popular image-classification benchmark.

  • And in this example, we also used the FP16 and XLA

  • techniques that Taylor talked about before.

  • And here you can see going from one

  • to eight GPUs we were able to get 85% scaling.

  • And this example was using the Keras model.fit API instead

  • of the custom-training example.

  • If you're interested, you can look at the link in the bottom

  • and you can try out this model yourself.

  • Another example is using the transformer language model

  • to show that this is not just for images.

  • We're able to scale up other types of models as well.

  • And in this case, we're able to get more than 90%

  • scaling when running from one to eight GPUs.

  • So, so far we've been talking about scaling

  • up to multiple GPUs on a single machine,

  • but most likely you would want to scale even further

  • to multiple machines, maybe even multiple GPUs,

  • or perhaps just with CPUs.

  • For these use cases, we provide the

  • MultiWorkerMirroredStrategy.

  • As the name suggests, this is very

  • similar to the MirroredStrategy that we've been talking about.

  • It also implements synchronous training

  • but this time across all the different machines

  • in your cluster.

  • In order to do the altered use, it

  • uses a new type of op in TensorFlow

  • called collective ops.

  • Collective op is a single op in the TensorFlow graph

  • which can determine the best altered use to use based

  • on a number of different factors such as the network

  • topology, the type of communication

  • available between the different machines,

  • as well as tensor sizes.

  • It can also implement optimization

  • such as tensor fusion.

  • So for instance if you have a lot of small tensors

  • that you want to aggregate, it may batch them up

  • into a single tensor in order to reduce the load on the network.

  • So how can you use MultiWorkerMirroredStrategy?

  • It's actually very similar to MirroredStrategy.

  • You create a strategy object like so,

  • and the rest of your training code

  • actually remains the same, so I've omitted it here.

  • Once you have the strategy object,

  • you can put the rest of your code in strategy.scope,

  • and you're good to go.

  • One more thing you need to do, however,

  • in the multi-worker case is to give us

  • information about your cluster.

  • And one way to do that for multi-worker strategy

  • is using the TF_CONFIG environment variable.

  • TF_CONFIG, you might be familiar with this environment variable

  • if you've used distributor training using estimator

  • in TensorFlow 1.

  • So it basically consists of two components.

  • The first is the cluster which gives us information

  • about your entire cluster.

  • So here we have three different workers

  • at these hosts and port.

  • And the second piece is information

  • about the specific worker.

  • So this is saying this is worker one.

  • Once you provide this TF_CONFIG, the task part

  • would be different on each worker,

  • but their training code can remain the same.

  • And distribution strategy will read this TF_CONFIG environment

  • variable and figured out how to communicate

  • to the different workers in your cluster.

  • So we've been talking about multiple machines,

  • multiple GPUs.

  • What about TPUs?

  • You've probably heard about TPUs in this conference somewhere

  • else before.

  • TPUs are custom hardware built by Google

  • to accelerate machine-learning workloads.

  • You can use them through cloud TPUs,

  • or you can even try them out in Codelab.

  • Distribution strategy provides TPUStrategy for you

  • to be able to scale up your training to TPUs.

  • It's very similar to MirroredStrategy

  • in that it implements synchronous training,

  • and it uses the cross_replicate_sum API

  • to do the all reduce across the TPU cores.

  • You can use this API to scale your training

  • to a single TPU or a slice of a pod or a full pod as well.

  • And if you want to try out TPUStrategy,

  • you can try it with TensorFlow nightlies,

  • or you can wait for the 2.1 stable release for this

  • as well.

  • Let's look at the code to see how you can use TPUStrategy.

  • The first thing you need to do is to provide information

  • about the TPU cluster.

  • So the way you do that is create a TPUClusterResolver

  • and give it the name or address of your TPU.

  • Once you have a ClusterResolver, you

  • can use the experimental_connect_to_cluster

  • API to connect to this cluster and also initialize

  • the TPU system.

  • Once you've done these three things,

  • you can then create the strategy object

  • and pass the ClusterResolver object to it.

  • Once you have the strategy object, the rest of the code

  • remains the same.

  • You create the strategy to scope,

  • and you put your training code inside that.

  • So I won't be going into that here.

  • So looking at this code, you can see that distribution strategy

  • makes it really easy to switch from training

  • onto multiple GPUs or multiple machines

  • to specialized hardware like TPUs.

  • So that's all I was going to talk

  • about distribution strategy.

  • To recap today, we talked about a few different things.

  • We talked about tf.data to build simple and performant input

  • pipelines.

  • We talked about how you can use tf.function

  • to run your training as a TensorFlow graph.

  • We talked about how you can use XLA and mixed precision

  • to improve your training speed even further.

  • And finally, we talked about the DistrbuteStrategy API

  • to scale out your training to more hardware.

  • We have a number of resources on our website, a number of guides

  • and tutorials, so we encourage you to check those out.

  • And if you have any questions or you want to report bugs,

  • please reach out to us on GitHub.

  • We'll also be available in the expo hall after the talk

  • if you have more questions.

  • And finally, the notebook that we've been using in the talk

  • is listed here.

  • So if you want to take a picture,

  • you can check it out later.

  • That's all.

  • Thank you so much for listening.

  • [APPLAUSE]

  • I guess we have some time, so we can take some questions.

  • AUDIENCE: Hi Thanks for the nice presentation.

  • So I have two questions, both regarding inference.

  • So you mentioned how we can use TensorBoard

  • for profiling to see how well our CPUs and GPUs are utilized.

  • Can we do the same to see how well our inferences are

  • working?

  • TAYLOR ROBIE: Yes.

  • So the profiler has no concept of training or inference.

  • It just looks at ops as they execute.

  • So if you're just running the forward pass,

  • they'll show up all the same.

  • PRIYA GUPTA: It also kind of depends on

  • how you're doing inference.

  • Are you doing it using Python APIs or are you--

  • AUDIENCE: SavedModel.

  • So let's say I have a SavedModel format,

  • and I want to use that in TensorFlow.

  • PRIYA GUPTA: So you're not really

  • using any of the Python APIs, right?

  • AUDIENCE: So I use Python APIs, but basically to--

  • PRIYA GUPTA: So you're doing a SavedModel in Python,

  • and then-- yeah, then you can use TensorFlow as well.

  • AUDIENCE: Regarding mixed precision,

  • will it help in inference as well and how to we enable that?

  • TAYLOR ROBIE: It can potentially.

  • Typically for inference, the integer low-precision formats

  • tend to be more important than the floating-point ones.

  • If you really care about very fast

  • inference with low precision, I would

  • suggest that you check out TF Lite because they have

  • a lot of quantization support.

  • AUDIENCE: Thank you.

  • AUDIENCE: Hello.

  • I had just one question about how

  • to save your model when you are training

  • in a DistributeStrategy because your code is replicated

  • in all the nodes.

  • What's the official way to do this?

  • PRIYA GUPTA: Yes, so you want to save a full SavedModel.

  • AUDIENCE: Yes.

  • PRIYA GUPTA: Yeah, so you can use the standard APIs,

  • just the tf.saved_model.save.

  • And what it will do is it will not save the replicated model.

  • It'll save a single copy of the model.

  • So then it's like any other SavedModel,

  • and you can then reload it.

  • If you load it inside another strategy again,

  • then you can then--

  • it will get replicated at that point,

  • or you can use it for inference as another SavedModel.

  • AUDIENCE: Because what I experienced

  • was that all the nodes tried to save

  • the model at the same time, and then it

  • was an issue between a race condition with the placement

  • where the models were saved.

  • PRIYA GUPTA: And so you got an error?

  • AUDIENCE: Yes, it crashed.

  • PRIYA GUPTA: Maybe we can talk offline.

  • It's supposed to work, but there could be bugs.

  • AUDIENCE: Thanks for the talk.

  • I have a quick question about mixed position and the XLA.

  • Will the API automatically use the hardware-related features?

  • Like for example, in mixed precision

  • if you have Volta machine, it will

  • be able to use Volta machine's [INAUDIBLE]??

  • TAYLOR ROBIE: Yes.

  • For instance, on Volta, it will automatically

  • try to use the Tensor Core.

  • And in fact XLA and mixed precision

  • synergize very well because XLA-- or sorry, mixed precision

  • inserts a bunch of casts, and then XLA will go along

  • and it will actually optimize a lot of those away.

  • It will try and make the layout more amenable

  • to mixed precision.

  • So XLA tends to talk to the hardware at a very low level.

  • AUDIENCE: So basically from user point of view it's transparent.

  • If I run the same code on the CPU, on the old GPU, new GPU,

  • it will automatically try to use the hardware ability?

  • TAYLOR ROBIE: Yes.

  • SPEAKER: I'm really sorry.

  • We kind of have to stop this now for the livestreaming.

  • PRIYA GUPTA: We can take more questions outside the hall.

TAYLOR ROBIE: I'm Taylor.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it