Placeholder Image

Subtitles section Play video

  • JIRI SIMSA: Hi, everyone.

  • My name is Jiri.

  • I'm a software engineer on the TensorFlow team.

  • And today, I'm going to be talking to you about tf.data

  • and tf.distribute, which are TensorFlow's APIs for input

  • pipeline and distribution strategy, respectively.

  • To set the stage for what I'm going to be talking about,

  • let's think about what are the basic building

  • blocks for a machine learning workflow?

  • Machine learning operates over data.

  • It runs some computation.

  • And it uses some sort of hardware to do this task.

  • This hardware can either be a single CPU on your laptop.

  • Or possibly it can be on your workstation that

  • has either one or multiple accelerators,

  • either GPUs or TPUs, attached to it.

  • But you can also run the computation

  • across a large number of machines

  • that each have one or multiple accelerators attached to it.

  • Now, let's talk about the how the machine learning building

  • blocks are being served or reflected in the APIs

  • that TensorFlow provides.

  • So for the data handling part of the machine learning task,

  • TensorFlow provides a tf.data API.

  • It's the input pipeline API for TensorFlow.

  • For the computation itself, such as supervised learning,

  • TensorFlow offers a number of different both high level

  • and low level APIs.

  • You might be familiar with Keras or Estimators--

  • they've been mentioned in earlier talks today--

  • as well as lower level APIs for building custom training loops.

  • And finally, to hide the hardware details

  • of your computation, TensorFlow provides a tf.distribute API,

  • which allows you to create your input pipeline

  • and model in a way that's agnostic to the environment

  • in which it's going to execute.

  • So kind of thinking that your program is

  • going to run, perhaps, on a single device,

  • and with minimal changes being able to deploy it

  • on a large set of different devices, a possibility

  • of different machine learning architectures.

  • In this talk, I'm going to talk about the tf.data input

  • pipeline API.

  • And then, in the second part, I'm

  • also going to talk about the tf.distribute, the distribution

  • strategy API.

  • I'm not going to talk about Keras, and Estimator,

  • and other APIs for the modeling itself,

  • as that has been covered in previous talks.

  • So without further ado, let's get

  • started with tf.data, which is TensorFlow input pipeline API.

  • So let's ask ourselves a question.

  • Why do we need an input pipeline API in the first place?

  • Why don't we just load the data in memory,

  • maybe in our Python program as a non py array,

  • and pass it into a Keras model?

  • Well, there is actually a number of good reasons

  • why we need an API or why using one will benefit us.

  • First of all, the data might not fit into memory.

  • For example, the ImageNet data set

  • is 140 gigabytes of data, which do not necessarily

  • fit into memory on every laptop or workstation.

  • The data itself might also require randomized

  • preprocessing, which means that we cannot preprocess everything

  • ahead of time offline and then have the data to be ready

  • for training.

  • We actually need to have an input pipeline that

  • performs the preprocessing, such as, in the case of ImageNet,

  • perhaps image cropping or randomized image distortions

  • or transformations on the fly as we're

  • running dimensional learning computation.

  • Having an input pipeline API as an abstraction might also

  • allow us to, in the runtime of this API,

  • implement things in a way that allows the computation

  • to efficiently utilize the underlying hardware.

  • And I'm actually going to spend a fair amount of the first part

  • of my talk talking about how to efficiently utilize

  • the hardware through the tf.data input pipeline abstraction.

  • Last, but not least, which is something that ties the tf.data

  • API to the tf.distribution API, using an input pipeline API

  • abstraction allows us to decouple

  • the task of loading and preprocessing of the data

  • from the task of distributing the computation.

  • We are using the abstraction, which

  • allows you to create your input pipeline assuming

  • it's going to run on one place.

  • And then the distribution strategy

  • will somehow distribute the data without you

  • having to worry about the fact that the input pipeline might

  • actually be evaluated in multiple places in parallel.

  • So for those reasons, we created tf.data, TensorFlow's input

  • pipeline API.

  • And the way I like to think about tf.data is an input

  • pipeline API's created through tf.data--

  • it's an ETL process.

  • What I mean by that is the E, T, and L

  • stand for different parts of the input pipeline stages.

  • E stands for Extract.

  • This is the stage in which we read the data,

  • either from a memory or local or remote storage.

  • And we possibly parse the file format

  • that the data is stored in.

  • Perhaps it's compressed.

  • Then the T, the Transform stage, in this stage,

  • we perform either domain specific or domain

  • agnostic transformations.

  • So the domain specific transformations

  • are specific to the type of data we're dealing with.

  • So, for instance, text vectorization,

  • image transformation, or temporal video sampling

  • are examples of domain specific transformations.

  • While domain agnostic transformations

  • include things like shuffling of your data

  • during training or batching.

  • That is combining multiple elements

  • into a single higher dimensional element.

  • And, finally, the last stage of the input pipeline, Loading,

  • pertains to efficiently transferring

  • the data onto the accelerator, which is either a GPU or TPU.

  • What I should point out here is that, traditionally, the input

  • pipeline portion of your machine learning computation

  • happens on a CPU.

  • Because some of the operations are naturally

  • only possible on the CPU, which leaves the GPU and TPU

  • resources available for your machine

  • learning specific computations, such as your map models.

  • This makes-- this puts an extra pressure

  • on the efficiency with which the input pipeline performs.

  • And the reason for that is--

  • which is what I'm trying to illustrate here

  • with the graph--

  • is that over time the rate at which CPU performs

  • has plateaued, while the computational power of GPUs

  • and TPUs, thanks to recent hardware advances,

  • continues to accelerate at an exponential rate, which

  • opens up this performance gap between a raw CPU

  • and GPU/TPU processing power available in a single machine.

  • And that can-- the consequence of this

  • could be that the CPU part of your machine learning

  • computation, namely the input pipeline,

  • can be a bottleneck of your computation.

  • So it's really important that the CPU input pipeline performs

  • as efficiently as it can.

  • So let's take a look at an example of what a tf.data-based

  • input pipeline actually looks like.

  • Here, I'm using an example for a common image,

  • or how a common image processing input pipeline would look like.

  • We're first creating a data set using the TFRecordDataset

  • operation.

  • It's a data set constructor that takes a set of file names

  • or a set of file patterns and produces

  • elements that are stored in those files

  • in a sequence-like manner.

  • And once you create a data set, you

  • can chain transformations onto the data set,

  • thus creating new types of data sets.

  • A very common one and very powerful

  • one is the map transformation, which

  • allows you to apply an arbitrary processing on the elements

  • of the data set.

  • And this preprocessing can be expressed

  • as a function that ends up being traced

  • using the mechanisms available in TensorFlow,

  • meaning this function that is being used to transform

  • elements of the data set is executed as a data flow

  • graph, which has important implications

  • for the performance and how the runtime can actually

  • execute this function.

  • And the last thing that I illustrate here

  • is the batch transformation, which

  • combines multiple elements of the input data set

  • and produces a single element as an output that

  • has a higher dimension, which is a common practice for training

  • efficiency.

  • Now one thing that's not illustrated here,

  • but it actually does happen under the hoods inside

  • of tf.data runtime is that for certain combinations

  • of transformations, a. tf.data provides more efficient

  • fused implementations.

  • For instance, if a map transformation is followed

  • by a batch transformation, we actually have a highly

  • efficient C++ based implementation

  • for the combination of the two that can give you up to 2x

  • speed up in the performance of your input pipeline.

  • And that happens kind of magically behind the scenes.

  • And the important bit that I want to highlight here

  • is that the user doesn't need to worry about it.

  • The user here doesn't really need

  • to do anything with respect to optimizing the performance.

  • They focus on creating an input pipeline with the functional

  • preprocessing in mind.

  • And once you create the data set that you would like,

  • you can pass it into TensorFlow high level API such as Keras

  • or Estimator, which all support data set abstraction

  • as an input for the data.

  • So let's talk a bit more about the input pipeline performance.

  • If you were to implement the input pipeline

  • in naive fashion using CPU for the input pipeline processing

  • or data preparation and the GPU and TPU for the training

  • computation, you might end up in a situation

  • like is illustrated on the slide where

  • at any given point in time you're

  • only utilizing one of two resources available to you.

  • And you could probably tell that this seems rather inefficient.

  • Well, a common technique that can

  • be used to make this style of computation more efficient

  • is called software pipeline.

  • And the idea is that while you're

  • working on the current element for training step

  • on a GPU and a TPU, you're already

  • started preprocessing data for the next training

  • step on a CPU.

  • And thus, you overlap the computation

  • that happens on the two devices or two

  • resources available to you.

  • To achieve that, the effect of software pipelining in tf.data

  • is pretty straight forward.

  • All you do is you chain a .prefetch transformation

  • to a particular point in your input pipeline.

  • And the effect of doing that will

  • be that the producer of the data up to that point

  • will be decoupled from the consumer of the data,

  • in this case, the Keras model.

  • And the two will be operating independently,

  • coordinating through an internal buffer.

  • And this will have the desired effect of software

  • pipelining that I illustrated in the previous slide.

  • Another opportunity for improving

  • the performance of your input pipeline

  • is to parallelize the transformation.

  • So the top part of this diagram illustrates

  • that we're using sequential processing for applying the map

  • transformation of the individual elements of the batch

  • that we are then going to create.

  • But there is no reason that you need

  • to do that unless there would, in effect, be some sort of data

  • or control dependency.

  • But commonly, there is not.

  • An in that case, you can parallelize

  • and overlap the preprocessing of all the individual elements

  • for which we're going to create the batch out of.

  • So let's take a look at how we would

  • do that using the tf.data API.

  • And similar to the software pipelining idea,

  • this is pretty straightforward.

  • You simply add a single argument,

  • num_parallel_calls, to the map transformation,

  • which indicates to the tf.data runtime

  • that it should, in fact, preprocess

  • elements of the input data set in parallel.

  • Important bit here is that the user doesn't really

  • need to worry about the threading

  • or multiprocessing and use complicated Python APIs

  • or be aware of things like the global interpreter log.

  • It just happens inside of the tf.data runtime,

  • which is implemented in C++.

  • And thus, it sidesteps the complexities

  • that the user would need to go through.

  • And a last best practice for optimizing

  • the performance of an input pipeline

  • is that of parallel extraction.

  • So similar to the parallel transformation,

  • where in that case the sequential mapping of the data

  • might have been the bottleneck, another potential source

  • of a bottleneck of your input pipeline

  • is the sequential nature with which date is being read.

  • If you're just reading elements from a file one file at a time,

  • the I/O could actually be a bottleneck

  • of your input pipeline.

  • And the answer to that, well, you

  • don't have to do that sequentially.

  • You can do it in parallel.

  • And to do that using the tf.data API, well,

  • this time it's not a one line change.

  • It's a two line change.

  • And so what changes is that we're

  • going to replace the TFRecordDataset

  • source with two lines.

  • The first line uses a list_files transformation,

  • which creates a data set that is going to contain all the file

  • names to which the particular pattern that we specify

  • evaluates to.

  • And then we're going to apply the interleave transformation

  • to this data set, which takes a user defined function,

  • which is a data set factoring operating on the inputs--

  • in this case, file names--

  • and producing data sets-- in this case,

  • TF record data sets for that particular file name.

  • And specifying the num_parallel_calls protocols

  • argument will determine how many files should we

  • be reading in parallel at any given point in time?

  • Now, I kind of cheated up to this point in my presentation.

  • Because I said, well, the user doesn't really

  • have to worry about performance and the aspects

  • of their environment.

  • And it turns out that in order to choose

  • optimal values for these num_parallel_calls arguments

  • or the buffer size for prefetch, you actually

  • have to understand your environment.

  • At least that's how it used to be, historically, when

  • this API was first introduced.

  • And over the past year or so, we actually

  • worked on lifting this restriction

  • and making the performance of tf.data great out of the box.

  • And the way this is achieved by is

  • instead of specifying manually what the right buffer

  • size or the right number of parallel calls

  • should be for these different transformations,

  • you can actually specify this special constant called

  • tf.data.experimental.AUTOTUNE.

  • And if you do that, this will indicate to the tf.data runtime

  • that you want to delegate the task of choosing

  • the optimal level of parallelism or buffer size

  • to the tf.data runtime.

  • And it will do that on your behalf.

  • I should mention that auto tuning, at this point,

  • is enabled by default. But you still

  • have to specify the constant if you actually

  • want to indicate which of these knobs should be autotuned.

  • You can also disable autotuning if you would like

  • to try to do this manually.

  • And the mechanism for disabling autotuning is tf.data.Options.

  • The tf.data.Options is an object that

  • specifies global options that should be used for your input

  • pipeline.

  • And besides controlling autotuning,

  • it can also be used to control things

  • like static optimizations that are not

  • enabled by default, because they are not always

  • a win, such as map vectorization or map parallelization,

  • or, for instance, specifying whether your input pipeline is

  • allowed to produce elements out of order, which, by default,

  • your input pipeline will be deterministic.

  • The options object also allows you to, for example,

  • collect statistics about data in your input pipeline.

  • And for the performance experts in the audience,

  • it also allows you to fine tune threading parameters of tf.data

  • internals.

  • And the way you would use tf.data.Options

  • is that once you create your data set,

  • you also create an instance of the options object

  • and set whatever options that you're interested in.

  • In this example, I'm setting the a map_parallelization

  • optimization on.

  • And then, importantly, you associate

  • the options object with the data set

  • using the with.options transformation, which,

  • similar to all the other transformations that I talked

  • about up to this point, returns back a new data set that

  • now has the options applied.

  • Last thing pertaining to tf.data that I would like to talk about

  • is the TensorFlow data sets project.

  • So up to this point, I've been talking about just core tf.data

  • API, which can be used by our users to create input pipeline

  • using--

  • starting from raw data.

  • However, for a lot of common existing data sets,

  • this is a repetitive task.

  • And especially machine learning learners or novice users

  • do not necessarily want to do that to get

  • started with machine learning.

  • And to address or make it easier to onboard new users,

  • as well as make it easier to use existing data sets,

  • the TensorFlow data set projects provides canned data

  • sets that are ready to be used with the rest of TensorFlow.

  • The way you could use TensorFlow data sets

  • project is once you import it as a module,

  • you can, for example, list, through the call list-builders,

  • the set of available data sets.

  • And I think, at this point, there is something like 60

  • plus different data sets spanning text, image, audio,

  • and video, that are supported through the TensorFlow data

  • sets project.

  • Then, through the load command, using the name

  • as the identifier of the data set

  • you would like to load and optionally

  • the split argument, which you can use to identify whether you

  • want the training or the test portion of the data

  • set, you get back an instance of a tf.data data set

  • that can be immediately used with your model.

  • Or you could, because it's a tf.data data set instance,

  • you can optionally apply some custom transformations to it,

  • such as, in this case, shuffling and batching.

  • Or, if you would like to just inspect

  • what's inside of the data set, you

  • could do so using a simple Python-like iteration where

  • you can print the elements of the data set.

  • So this concludes the first part of my talk

  • in which I talked about tf.data.

  • And in the second part of my talk,

  • we're going to talk about the distribution strategy API.

  • So similar to the first part our talk, where we asked ourselves,

  • why do we need an input pipeline API?

  • Let's start by asking ourselves, why

  • do we need to do distributed training?

  • Why do we need distribution strategy API?

  • Well, it turns out that if we do training in one machine,

  • on one device, it can take a pretty long time.

  • This graph illustrates that by showing the accuracy achieved

  • by the ResNet model over time on the ImageNet

  • data set using a single GPU.

  • And you can see that it takes close to 90 hours

  • to get to accuracy around 75%, while the most performant

  • implementations of the same model, or deployments

  • of the same model actually take less than 10 minutes using

  • an amazing amount of resources parallelizing this computation.

  • What going down from 87 hours to 10 minutes enables

  • is that you can actually experiment

  • with ideas very quickly as opposed

  • to starting an experiment and waiting for one or two

  • days before you can do the next iteration.

  • And I think this is game changing.

  • So I hope I convince you that distributing your computation,

  • if it takes a very long time, is a very good idea.

  • So let's talk about how you do that with TensorFlow's

  • distribution strategy API.

  • There is three main goals that the distribution strategy

  • API has.

  • First of all, it should be easy to use.

  • What this means is that it should be possible for you

  • to create your input pipeline and your model assuming

  • that it's going to run on one device and then,

  • with minimal code changes, be able to deploy

  • to different architectures, either multiple GPUs

  • on your workstation or possibly even a cluster of workstations

  • that either have GPUs or TPUs attached to it.

  • It should also provide great out of box performance.

  • This means that the performance that you get out

  • of using distribution strategy should

  • be close to the performance you would get if you were manually

  • targeting a specific architecture

  • with your implementation.

  • And finally, it should be versatile.

  • So it should support different types of architectures,

  • different types of hardware, and different types of APIs

  • for your input pipeline or model.

  • The use cases for the distribution strategy API

  • can be roughly categorized as follows,

  • ranging from the simplest to perhaps the most advanced.

  • So the simplest one is you have a model that

  • uses either the Keras or Estimator API.

  • And you would like to distribute it.

  • And this is what we are going to cover in this talk.

  • The second one is you have a model that you

  • used lower level TensorFlow APIs to create a custom training

  • loop.

  • And you would like to distribute it.

  • And we're also are going to cover that in this talk.

  • Now, the last two, the more advanced ones,

  • namely making a layer, library, or infrastructure

  • distribution-aware-- so, for example, how

  • would you make something like Keras distribution aware?

  • [INAUDIBLE] how would you make a new strategy,

  • where strategy is something that is an abstraction that

  • hides or decouples the model and input

  • pipeline from the particular architecture?

  • Those two use cases we will not cover in this talk.

  • But they're covered by guides and tutorials

  • on the TensorFlow web site.

  • So in case that you would like to learn more,

  • I direct your attention to the TensorFlow website.

  • So let's start by talking about the use case

  • where you have a model that's created

  • either and Keras and Estimator.

  • And you would like to distribute it using the distribution

  • strategy.

  • And in this section, I'm also going

  • to introduce the distribution strategies that are actually

  • available in TensorFlow.

  • So the first the strategy that's available that's called

  • mirrored strategy is one that allows you to distribute

  • your program across multiple GPUs attached

  • to a single worker.

  • And the particular implementation

  • of this strategy using something called

  • all-reduce synchronous training, where the synchronous part

  • means that all of the devices will be performing steps

  • in a lock-step, so in a coordinated fashion.

  • While the all-reduce portion pertains

  • to how the different devices exchange information

  • about the local updates that they collect in each step.

  • To shed a little more light onto how the all-reduce algorithm

  • works, on this slide, I illustrate

  • what happens in the all-reduce algorithm

  • when you have three GPUs that each perform a single step that

  • updates a mirrored version of three variables.

  • So each of the boxes, the blue, the green, and the pink,

  • corresponds to a variable that, in a single step,

  • receive different updates on different devices.

  • And once the step is performed on all the devices,

  • we can propagate the updates in a circular fashion

  • between the different devices.

  • And at that point, all of the devices

  • will have all of the updates from all the devices for all

  • of the variables, requiring N minus one transfers for N

  • devices.

  • And then, once all the updates have been collected,

  • a reduce function can be used to combine

  • the updates to a single global value

  • where the common reduced functions are either

  • a sum or an average of those updates.

  • And with that knowledge in mind, a single step

  • of a synchronous training can be illustrated

  • on this example, where let's assume we

  • have a model with two layers.

  • And each layer has two variables.

  • And the variables are mirrored on each device.

  • We have two devices.

  • Now, in the forward pass, data is

  • propagated through the layers.

  • And then in a backward pass, the gradients for the variables

  • are computed.

  • And at that point, the updates to the two variables

  • on the different devices might be different.

  • Because we actually use two different pieces

  • of data on each device.

  • And at that point, it's where we use the all-reduce algorithm

  • to share the updates on each device with each other,

  • and thus achieving a global state across the two devices.

  • And this is what synchronous training refers to.

  • Now, to-- let's take a look at how you would actually

  • go about using a mirrored strategy with the Keras

  • and the Estimator APIs.

  • So to create an instance of a mirrored strategy, you can--

  • there's a couple of different factories, a default one or one

  • where you can explicitly name the devices

  • that you would like to create the mirrored strategy for.

  • I believe that the default is if you don't specify it,

  • it's going to be all the GPUs attached to your worker.

  • And you can also optionally specify arguments

  • for the all-reduce algorithm through the cross device

  • of argument of MirroredStrategy constructor.

  • Now, how would we use mirrored strategy or any other strategy,

  • for that matter, with Keras API?

  • Well, here's a common or a simple example

  • of a Keras model for ResNet 50 with a stochastic gradient

  • descent optimizer.

  • We create the model.

  • We specify the optimizer.

  • And then we use the compile and fit APIs to perform training

  • over our training data set, which is an instance of tf.data

  • data set.

  • Now, this runs on a single machine,

  • possibly using a local GPU.

  • In case we have multiple GPUs, we

  • can simply define an instance of MirroredStrategy

  • and then make sure that all of the model creation

  • is wrapped inside of a strategy.scope.

  • And with these two lines, your program

  • will now be able to run on all the GPUs

  • available on the worker.

  • And the key here is that the strategy.scope

  • will take care of variable creation inside of your model,

  • making sure that all the variables are mirrored

  • on the different GPU devices.

  • And the body of the strategy.scope

  • will be distribution aware.

  • So recall that one of the goals for the distribution strategy

  • API was that it provides great out of box performance.

  • So on this slide, I would like to convince you

  • that it does, at least for mirrored strategy

  • on the ResNet 50.

  • So what we're looking at here is the performance

  • of a ResNet 50 based training using Keras,

  • running TensorFlow 2.0 on Google Cloud.

  • The vertical axis of the graph plots images per second.

  • And the horizontal axis ranges the number of GPUs from one

  • to two to eight.

  • And we can see that using mirrored strategy

  • achieves close to linear scaling,

  • starting with a single GPU achieving roughly 1,250

  • images per second to eight GPUs achieving close to 10,000

  • images per second.

  • Now, we've covered the Estimator-- sorry,

  • the Keras API usage with distribution strategy.

  • Let's also cover the Estimator API usage.

  • So this is a common example of how you would use the Estimator

  • API for your training.

  • Namely, you define a classifier using the Estimator constructor

  • that you provision with a model function.

  • And then, through the train call,

  • you specify an input function, which can return, for instance,

  • a tf.data data set.

  • And it performs the training.

  • In order to parameterize the Estimator API with a strategy,

  • all you need to do is, again, to create

  • an instance of the strategy, in this case, MirroredStrategy,

  • and pass it in through the RunConfig option

  • into the Estimator API.

  • And once that happens, the RunConfig

  • will actually-- with a strategy, will make sure

  • that the model function is created once per replica.

  • And replica, in this case, refers to the GPU.

  • So you're going to have copies of the model on each GPU

  • as well as of all the variables inside of the model.

  • And you will perform the all-reduce synchronous training

  • across the multiple GPUs.

  • So distributing your computation across multiple GPUs

  • on a single machine can get you up to N,

  • where N is the number of accelerators attached

  • to your machine, speed up.

  • But there is a physical limit to how many

  • accelerators you can have.

  • And to go beyond that limit, the natural next thing

  • is to actually use multiple machines with each one

  • [? are ?] multiple accelerators.

  • And that's what the multi-worker mirrored strategy

  • is intended to help you with.

  • And it's very similar to the mirrored strategy.

  • The only difference is that instead of distributing

  • your computation over GPUs on a single machine,

  • it distributes the computation over GPUs on many machines.

  • And it performs the all-reduce computation

  • not just across GPUs on a single workstation,

  • but across the GPUs on all the different workstations.

  • And it does so through TensorFlow collective ops,

  • which allows you to actually send

  • data in a broadcast fashion between the different

  • TensorFlow workers.

  • The way you would use this API is

  • similar to the mirrored strategy API.

  • So you can create a default instance

  • of a MultiWorkerMirroredStrategy.

  • Or you can specify a specific CollectiveCommunication

  • algorithm to be used.

  • Unlike the mirror strategy, you also

  • need to specify information about the different workers

  • that are participating inside of your computation.

  • And this is done so through a JSON encoded string that

  • identifies the host and ports of your different workers

  • as well as task types.

  • The third strategy that I'm going to talk about,

  • and that's available in the TensorFlow distribution

  • strategy API is the TPU strategy.

  • And this one is very similar to the mirrored strategy.

  • The main difference is that it allows

  • you to perform the all-reduce synchronous training on TPUs,

  • which are the hardware accelerators made by Google

  • specifically for TensorFlow.

  • But at this point, there are also

  • other frameworks that are capable of leveraging them.

  • And you can do so through the Google Cloud platform.

  • And unlike the mirrored strategy,

  • it uses the cross_replica_sum to perform the all-reduce

  • on TPUs, which is something that's

  • a difference between GPUs and TPUs.

  • And you can use this strategy for training

  • on a single TPU or an entire pod,

  • which that's a term that refers to a set of TPU cores

  • in a topology.

  • To use a TPU strategy is a little more complicated.

  • And it's also somewhat of an area of active development,

  • which the experimental portions of the API refer to.

  • But the high level idea is that you create a TPU cluster

  • resolver, which allows you to gather information

  • about your TPU hardware.

  • And then you create the TPU strategy

  • with this cluster resolver argument,

  • which then allows the TPU strategy to be

  • aware of the TPU hardware location and specifics.

  • And so up to this point, I've been

  • talking about synchronous training, where

  • all the devices in your training loop

  • are performing or operating in a lock-step, one step at a time.

  • An alternative to synchronous training,

  • which might be suitable for certain types of machine

  • learning tasks, is so-called asynchronous training,

  • where the different devices or different workers

  • in your set of workers performing your computation

  • are actually running at different rates.

  • And one of the architectures that

  • enables asynchronous training is a so-called parameter server

  • and worker architecture where your machines

  • have one of two roles, parameter server tasks or worker tasks.

  • The parameter server tasks is where global variable state

  • is stored and either updated or fetched

  • from by the individual workers.

  • While the workers perform a dimension learning

  • computation one step at a time, but not necessarily

  • at the same rate.

  • And this architecture can be targeted

  • for your machine learning program using the parameter

  • server strategy.

  • You create it using this factoring.

  • And similar to the multi-worker strategy,

  • you need to specify information about the workers and the types

  • of tasks that the worker machines should play,

  • namely the worker task or the parameter server task.

  • And again, this is done so through the TF_CONFIG

  • environment variable.

  • And a last strategy that I want to talk about

  • and that's available through the distribution strategy API

  • is the central storage strategy.

  • This is a special case of the parameter server strategy

  • where there is a single parameter server.

  • And its role is being fulfilled by a CPU

  • of the machine on which the other devices reside.

  • And the benefit of this strategy is

  • that any single GPU might not be able to fit

  • all the embeddings, all the variable states inside of them.

  • But the CPU might.

  • And in cases where this is a good fit,

  • the central storage strategy is available.

  • And this is how you would create one.

  • And that concludes the part talking about the Keras

  • and Estimator API support, as well as

  • the enumeration of the different types of strategies

  • that are available in the tf.distribution strategy API.

  • And in the last part of my talk, I'm

  • going to talk about how would you

  • go about distributing a model that you created using a custom

  • training loop, which is effectively a model

  • created out of lower level TensorFlow APIs?

  • The prerequisite for your custom training loop

  • to be distributable using the distribution strategy API

  • is that it has to adhere to the following programming model.

  • In particular, as far as data sources are concerned,

  • your variables may be read from any replica.

  • But the input data that's used for training

  • will be sharded, meaning divided into disjoint sets that

  • will be accessed exclusively by one replica.

  • Each replica performs computation on its sources.

  • And then the computation is combined using a reduction.

  • So, in essence, this programming model

  • is that of all-reduce synchronous training.

  • But it can be implemented using lower level TensorFlow APIs.

  • So let's take a look at how an example of a custom training

  • loop distributed through a distribution strategy

  • would look like.

  • So we create an instance of a distribution strategy.

  • And then we create a data set using your own create data set

  • method that takes a batch size.

  • The important bit here is that the batch size

  • should be the global batch size, that is a batch size that you

  • choose independently of the number of replicas or devices

  • on which you are going to run your computation on.

  • And it's going to be the responsibility

  • of the distribution strategy API to actually divide

  • this global batch size into per replica batch sizes.

  • And this is done through the experimental_distribute_dataset

  • invocation, which wraps the, quote unquote, sequential, data

  • set in what's called a distributed data set.

  • But as far as the custom training loop is concerned,

  • there is no difference between the two.

  • And your model, similar to the Keras API usage,

  • should be created under the strategy.scope, which

  • means that all the variables must be created

  • under this scope so that they're properly mirrored

  • across the different replicas.

  • As an alternative to delegating the distribution

  • of your data set to TF distribution strategy,

  • you can also use an alternative API, distribute dataset

  • from function, which gives the user the control

  • to decide what portions of the data set

  • should be distributed on which replica

  • and how by, instead of providing a data set,

  • you provide a data set factoring, which can input

  • the distribution strategy context,

  • which has information such as the particular replica

  • index or the total number of replicas.

  • And then in the rest of my presentation,

  • we're going to take a look at how you would actually

  • build a custom training loop in a kind of bottom up fashion.

  • So the first thing, the lowest building block

  • is the logic that performs a single training

  • step on a replica.

  • And here's an example of how you would do that.

  • So you would use a GradientTape, perform some computation,

  • and then with the [INAUDIBLE] of with the tape,

  • compute gradients, apply them to model variables,

  • and return the loss.

  • And this is something that happens on a single replica.

  • Now, to tie this computation across--

  • that happens on different replicas

  • together in a single training epoch,

  • you can enumerate the individual elements of the data set

  • using Python iteration.

  • And then use the run API of the distribution strategy

  • with the replica step function and the input

  • to collect the loss for that particular replica.

  • And combine the individual losses

  • using the reduce call with a particular reduce operation.

  • And at that point, you could do any per-step processing

  • inside of this for loop.

  • For an example, you could print the loss.

  • But you could do other types of computations here as well.

  • Now, one thing you might notice is

  • that this train_epoch function has a tf.function decorator.

  • The effect of this decorator is that TensorFlow will interpret

  • this Python function as a graph computation,

  • optionally using autograph to convert Python idioms,

  • such as the Python iteration of our data set,

  • into equivalent graph building methods.

  • And the reason we recommend using tf.function decorator

  • for your train_epoch here is that it will generally result

  • in much better performance.

  • Because the entire training epoch

  • will be executing as a data flow graph as

  • opposed to a Python function.

  • Now, the last step of your custom training loop

  • is the iteration over multiple epochs, which

  • is pretty straightforward.

  • And this just illustrates how you

  • do that, and optionally inserting per epoch processing

  • inside of the outer loop, such as checkpointing your model

  • or running an eval of the model.

  • So before I end, I want to give you

  • an overview of what's supported in TF 2.0 beta

  • as far as distribution strategy is concerned.

  • And this is a screenshot from the TensorFlow website.

  • So you can either take a picture now,

  • or you can also go to the website.

  • In the first column, we see the three types

  • of model building APIs, namely Keras, Estimator,

  • and custom training loop.

  • And then on the top row, we have the different types

  • of strategies.

  • And, as you can see, the Estimator API

  • is well supported across different types of strategies

  • while the other combinations are supported or on the way.

  • Most of them are targeting the RC release candidate of 2.0

  • for availability.

  • And that brings me to the end of my talk.

  • Thank you very much for your attention.

  • In case-- so throughout the talk,

  • I've been sharing links to different tutorials.

  • All the tutorials can be found on the TensorFlow web site

  • under the resources link shown here.

  • And in case you have any questions

  • or you would like to request a feature or report issues,

  • our GitHub repository is the correct forum for that.

  • So thank you very much for your attention.

  • [APPLAUSE]

  • [MUSIC PLAYING]

JIRI SIMSA: Hi, everyone.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it