Subtitles section Play video Print subtitles JIRI SIMSA: Hi, everyone. My name is Jiri. I'm a software engineer on the TensorFlow team. And today, I'm going to be talking to you about tf.data and tf.distribute, which are TensorFlow's APIs for input pipeline and distribution strategy, respectively. To set the stage for what I'm going to be talking about, let's think about what are the basic building blocks for a machine learning workflow? Machine learning operates over data. It runs some computation. And it uses some sort of hardware to do this task. This hardware can either be a single CPU on your laptop. Or possibly it can be on your workstation that has either one or multiple accelerators, either GPUs or TPUs, attached to it. But you can also run the computation across a large number of machines that each have one or multiple accelerators attached to it. Now, let's talk about the how the machine learning building blocks are being served or reflected in the APIs that TensorFlow provides. So for the data handling part of the machine learning task, TensorFlow provides a tf.data API. It's the input pipeline API for TensorFlow. For the computation itself, such as supervised learning, TensorFlow offers a number of different both high level and low level APIs. You might be familiar with Keras or Estimators-- they've been mentioned in earlier talks today-- as well as lower level APIs for building custom training loops. And finally, to hide the hardware details of your computation, TensorFlow provides a tf.distribute API, which allows you to create your input pipeline and model in a way that's agnostic to the environment in which it's going to execute. So kind of thinking that your program is going to run, perhaps, on a single device, and with minimal changes being able to deploy it on a large set of different devices, a possibility of different machine learning architectures. In this talk, I'm going to talk about the tf.data input pipeline API. And then, in the second part, I'm also going to talk about the tf.distribute, the distribution strategy API. I'm not going to talk about Keras, and Estimator, and other APIs for the modeling itself, as that has been covered in previous talks. So without further ado, let's get started with tf.data, which is TensorFlow input pipeline API. So let's ask ourselves a question. Why do we need an input pipeline API in the first place? Why don't we just load the data in memory, maybe in our Python program as a non py array, and pass it into a Keras model? Well, there is actually a number of good reasons why we need an API or why using one will benefit us. First of all, the data might not fit into memory. For example, the ImageNet data set is 140 gigabytes of data, which do not necessarily fit into memory on every laptop or workstation. The data itself might also require randomized preprocessing, which means that we cannot preprocess everything ahead of time offline and then have the data to be ready for training. We actually need to have an input pipeline that performs the preprocessing, such as, in the case of ImageNet, perhaps image cropping or randomized image distortions or transformations on the fly as we're running dimensional learning computation. Having an input pipeline API as an abstraction might also allow us to, in the runtime of this API, implement things in a way that allows the computation to efficiently utilize the underlying hardware. And I'm actually going to spend a fair amount of the first part of my talk talking about how to efficiently utilize the hardware through the tf.data input pipeline abstraction. Last, but not least, which is something that ties the tf.data API to the tf.distribution API, using an input pipeline API abstraction allows us to decouple the task of loading and preprocessing of the data from the task of distributing the computation. We are using the abstraction, which allows you to create your input pipeline assuming it's going to run on one place. And then the distribution strategy will somehow distribute the data without you having to worry about the fact that the input pipeline might actually be evaluated in multiple places in parallel. So for those reasons, we created tf.data, TensorFlow's input pipeline API. And the way I like to think about tf.data is an input pipeline API's created through tf.data-- it's an ETL process. What I mean by that is the E, T, and L stand for different parts of the input pipeline stages. E stands for Extract. This is the stage in which we read the data, either from a memory or local or remote storage. And we possibly parse the file format that the data is stored in. Perhaps it's compressed. Then the T, the Transform stage, in this stage, we perform either domain specific or domain agnostic transformations. So the domain specific transformations are specific to the type of data we're dealing with. So, for instance, text vectorization, image transformation, or temporal video sampling are examples of domain specific transformations. While domain agnostic transformations include things like shuffling of your data during training or batching. That is combining multiple elements into a single higher dimensional element. And, finally, the last stage of the input pipeline, Loading, pertains to efficiently transferring the data onto the accelerator, which is either a GPU or TPU. What I should point out here is that, traditionally, the input pipeline portion of your machine learning computation happens on a CPU. Because some of the operations are naturally only possible on the CPU, which leaves the GPU and TPU resources available for your machine learning specific computations, such as your map models. This makes-- this puts an extra pressure on the efficiency with which the input pipeline performs. And the reason for that is-- which is what I'm trying to illustrate here with the graph-- is that over time the rate at which CPU performs has plateaued, while the computational power of GPUs and TPUs, thanks to recent hardware advances, continues to accelerate at an exponential rate, which opens up this performance gap between a raw CPU and GPU/TPU processing power available in a single machine. And that can-- the consequence of this could be that the CPU part of your machine learning computation, namely the input pipeline, can be a bottleneck of your computation. So it's really important that the CPU input pipeline performs as efficiently as it can. So let's take a look at an example of what a tf.data-based input pipeline actually looks like. Here, I'm using an example for a common image, or how a common image processing input pipeline would look like. We're first creating a data set using the TFRecordDataset operation. It's a data set constructor that takes a set of file names or a set of file patterns and produces elements that are stored in those files in a sequence-like manner. And once you create a data set, you can chain transformations onto the data set, thus creating new types of data sets. A very common one and very powerful one is the map transformation, which allows you to apply an arbitrary processing on the elements of the data set. And this preprocessing can be expressed as a function that ends up being traced using the mechanisms available in TensorFlow, meaning this function that is being used to transform elements of the data set is executed as a data flow graph, which has important implications for the performance and how the runtime can actually execute this function. And the last thing that I illustrate here is the batch transformation, which combines multiple elements of the input data set and produces a single element as an output that has a higher dimension, which is a common practice for training efficiency. Now one thing that's not illustrated here, but it actually does happen under the hoods inside of tf.data runtime is that for certain combinations of transformations, a. tf.data provides more efficient fused implementations. For instance, if a map transformation is followed by a batch transformation, we actually have a highly efficient C++ based implementation for the combination of the two that can give you up to 2x speed up in the performance of your input pipeline. And that happens kind of magically behind the scenes. And the important bit that I want to highlight here is that the user doesn't need to worry about it. The user here doesn't really need to do anything with respect to optimizing the performance. They focus on creating an input pipeline with the functional preprocessing in mind. And once you create the data set that you would like, you can pass it into TensorFlow high level API such as Keras or Estimator, which all support data set abstraction as an input for the data. So let's talk a bit more about the input pipeline performance. If you were to implement the input pipeline in naive fashion using CPU for the input pipeline processing or data preparation and the GPU and TPU for the training computation, you might end up in a situation like is illustrated on the slide where at any given point in time you're only utilizing one of two resources available to you. And you could probably tell that this seems rather inefficient. Well, a common technique that can be used to make this style of computation more efficient is called software pipeline. And the idea is that while you're working on the current element for training step on a GPU and a TPU, you're already started preprocessing data for the next training step on a CPU. And thus, you overlap the computation that happens on the two devices or two resources available to you. To achieve that, the effect of software pipelining in tf.data is pretty straight forward. All you do is you chain a .prefetch transformation to a particular point in your input pipeline. And the effect of doing that will be that the producer of the data up to that point will be decoupled from the consumer of the data, in this case, the Keras model. And the two will be operating independently, coordinating through an internal buffer. And this will have the desired effect of software pipelining that I illustrated in the previous slide. Another opportunity for improving the performance of your input pipeline is to parallelize the transformation. So the top part of this diagram illustrates that we're using sequential processing for applying the map transformation of the individual elements of the batch that we are then going to create. But there is no reason that you need to do that unless there would, in effect, be some sort of data or control dependency. But commonly, there is not. An in that case, you can parallelize and overlap the preprocessing of all the individual elements for which we're going to create the batch out of. So let's take a look at how we would do that using the tf.data API. And similar to the software pipelining idea, this is pretty straightforward. You simply add a single argument, num_parallel_calls, to the map transformation, which indicates to the tf.data runtime that it should, in fact, preprocess elements of the input data set in parallel. Important bit here is that the user doesn't really need to worry about the threading or multiprocessing and use complicated Python APIs or be aware of things like the global interpreter log. It just happens inside of the tf.data runtime, which is implemented in C++. And thus, it sidesteps the complexities that the user would need to go through. And a last best practice for optimizing the performance of an input pipeline is that of parallel extraction. So similar to the parallel transformation, where in that case the sequential mapping of the data might have been the bottleneck, another potential source of a bottleneck of your input pipeline is the sequential nature with which date is being read. If you're just reading elements from a file one file at a time, the I/O could actually be a bottleneck of your input pipeline. And the answer to that, well, you don't have to do that sequentially. You can do it in parallel. And to do that using the tf.data API, well, this time it's not a one line change. It's a two line change. And so what changes is that we're going to replace the TFRecordDataset source with two lines. The first line uses a list_files transformation, which creates a data set that is going to contain all the file names to which the particular pattern that we specify evaluates to. And then we're going to apply the interleave transformation to this data set, which takes a user defined function, which is a data set factoring operating on the inputs-- in this case, file names-- and producing data sets-- in this case, TF record data sets for that particular file name. And specifying the num_parallel_calls protocols argument will determine how many files should we be reading in parallel at any given point in time? Now, I kind of cheated up to this point in my presentation. Because I said, well, the user doesn't really have to worry about performance and the aspects of their environment. And it turns out that in order to choose optimal values for these num_parallel_calls arguments or the buffer size for prefetch, you actually have to understand your environment. At least that's how it used to be, historically, when this API was first introduced. And over the past year or so, we actually worked on lifting this restriction and making the performance of tf.data great out of the box. And the way this is achieved by is instead of specifying manually what the right buffer size or the right number of parallel calls should be for these different transformations, you can actually specify this special constant called tf.data.experimental.AUTOTUNE. And if you do that, this will indicate to the tf.data runtime that you want to delegate the task of choosing the optimal level of parallelism or buffer size to the tf.data runtime. And it will do that on your behalf. I should mention that auto tuning, at this point, is enabled by default. But you still have to specify the constant if you actually want to indicate which of these knobs should be autotuned. You can also disable autotuning if you would like to try to do this manually. And the mechanism for disabling autotuning is tf.data.Options. The tf.data.Options is an object that specifies global options that should be used for your input pipeline. And besides controlling autotuning, it can also be used to control things like static optimizations that are not enabled by default, because they are not always a win, such as map vectorization or map parallelization, or, for instance, specifying whether your input pipeline is allowed to produce elements out of order, which, by default, your input pipeline will be deterministic. The options object also allows you to, for example, collect statistics about data in your input pipeline. And for the performance experts in the audience, it also allows you to fine tune threading parameters of tf.data internals. And the way you would use tf.data.Options is that once you create your data set, you also create an instance of the options object and set whatever options that you're interested in. In this example, I'm setting the a map_parallelization optimization on. And then, importantly, you associate the options object with the data set using the with.options transformation, which, similar to all the other transformations that I talked about up to this point, returns back a new data set that now has the options applied. Last thing pertaining to tf.data that I would like to talk about is the TensorFlow data sets project. So up to this point, I've been talking about just core tf.data API, which can be used by our users to create input pipeline using-- starting from raw data. However, for a lot of common existing data sets, this is a repetitive task. And especially machine learning learners or novice users do not necessarily want to do that to get started with machine learning. And to address or make it easier to onboard new users, as well as make it easier to use existing data sets, the TensorFlow data set projects provides canned data sets that are ready to be used with the rest of TensorFlow. The way you could use TensorFlow data sets project is once you import it as a module, you can, for example, list, through the call list-builders, the set of available data sets. And I think, at this point, there is something like 60 plus different data sets spanning text, image, audio, and video, that are supported through the TensorFlow data sets project. Then, through the load command, using the name as the identifier of the data set you would like to load and optionally the split argument, which you can use to identify whether you want the training or the test portion of the data set, you get back an instance of a tf.data data set that can be immediately used with your model. Or you could, because it's a tf.data data set instance, you can optionally apply some custom transformations to it, such as, in this case, shuffling and batching. Or, if you would like to just inspect what's inside of the data set, you could do so using a simple Python-like iteration where you can print the elements of the data set. So this concludes the first part of my talk in which I talked about tf.data. And in the second part of my talk, we're going to talk about the distribution strategy API. So similar to the first part our talk, where we asked ourselves, why do we need an input pipeline API? Let's start by asking ourselves, why do we need to do distributed training? Why do we need distribution strategy API? Well, it turns out that if we do training in one machine, on one device, it can take a pretty long time. This graph illustrates that by showing the accuracy achieved by the ResNet model over time on the ImageNet data set using a single GPU. And you can see that it takes close to 90 hours to get to accuracy around 75%, while the most performant implementations of the same model, or deployments of the same model actually take less than 10 minutes using an amazing amount of resources parallelizing this computation. What going down from 87 hours to 10 minutes enables is that you can actually experiment with ideas very quickly as opposed to starting an experiment and waiting for one or two days before you can do the next iteration. And I think this is game changing. So I hope I convince you that distributing your computation, if it takes a very long time, is a very good idea. So let's talk about how you do that with TensorFlow's distribution strategy API. There is three main goals that the distribution strategy API has. First of all, it should be easy to use. What this means is that it should be possible for you to create your input pipeline and your model assuming that it's going to run on one device and then, with minimal code changes, be able to deploy to different architectures, either multiple GPUs on your workstation or possibly even a cluster of workstations that either have GPUs or TPUs attached to it. It should also provide great out of box performance. This means that the performance that you get out of using distribution strategy should be close to the performance you would get if you were manually targeting a specific architecture with your implementation. And finally, it should be versatile. So it should support different types of architectures, different types of hardware, and different types of APIs for your input pipeline or model. The use cases for the distribution strategy API can be roughly categorized as follows, ranging from the simplest to perhaps the most advanced. So the simplest one is you have a model that uses either the Keras or Estimator API. And you would like to distribute it. And this is what we are going to cover in this talk. The second one is you have a model that you used lower level TensorFlow APIs to create a custom training loop. And you would like to distribute it. And we're also are going to cover that in this talk. Now, the last two, the more advanced ones, namely making a layer, library, or infrastructure distribution-aware-- so, for example, how would you make something like Keras distribution aware? [INAUDIBLE] how would you make a new strategy, where strategy is something that is an abstraction that hides or decouples the model and input pipeline from the particular architecture? Those two use cases we will not cover in this talk. But they're covered by guides and tutorials on the TensorFlow web site. So in case that you would like to learn more, I direct your attention to the TensorFlow website. So let's start by talking about the use case where you have a model that's created either and Keras and Estimator. And you would like to distribute it using the distribution strategy. And in this section, I'm also going to introduce the distribution strategies that are actually available in TensorFlow. So the first the strategy that's available that's called mirrored strategy is one that allows you to distribute your program across multiple GPUs attached to a single worker. And the particular implementation of this strategy using something called all-reduce synchronous training, where the synchronous part means that all of the devices will be performing steps in a lock-step, so in a coordinated fashion. While the all-reduce portion pertains to how the different devices exchange information about the local updates that they collect in each step. To shed a little more light onto how the all-reduce algorithm works, on this slide, I illustrate what happens in the all-reduce algorithm when you have three GPUs that each perform a single step that updates a mirrored version of three variables. So each of the boxes, the blue, the green, and the pink, corresponds to a variable that, in a single step, receive different updates on different devices. And once the step is performed on all the devices, we can propagate the updates in a circular fashion between the different devices. And at that point, all of the devices will have all of the updates from all the devices for all of the variables, requiring N minus one transfers for N devices. And then, once all the updates have been collected, a reduce function can be used to combine the updates to a single global value where the common reduced functions are either a sum or an average of those updates. And with that knowledge in mind, a single step of a synchronous training can be illustrated on this example, where let's assume we have a model with two layers. And each layer has two variables. And the variables are mirrored on each device. We have two devices. Now, in the forward pass, data is propagated through the layers. And then in a backward pass, the gradients for the variables are computed. And at that point, the updates to the two variables on the different devices might be different. Because we actually use two different pieces of data on each device. And at that point, it's where we use the all-reduce algorithm to share the updates on each device with each other, and thus achieving a global state across the two devices. And this is what synchronous training refers to. Now, to-- let's take a look at how you would actually go about using a mirrored strategy with the Keras and the Estimator APIs. So to create an instance of a mirrored strategy, you can-- there's a couple of different factories, a default one or one where you can explicitly name the devices that you would like to create the mirrored strategy for. I believe that the default is if you don't specify it, it's going to be all the GPUs attached to your worker. And you can also optionally specify arguments for the all-reduce algorithm through the cross device of argument of MirroredStrategy constructor. Now, how would we use mirrored strategy or any other strategy, for that matter, with Keras API? Well, here's a common or a simple example of a Keras model for ResNet 50 with a stochastic gradient descent optimizer. We create the model. We specify the optimizer. And then we use the compile and fit APIs to perform training over our training data set, which is an instance of tf.data data set. Now, this runs on a single machine, possibly using a local GPU. In case we have multiple GPUs, we can simply define an instance of MirroredStrategy and then make sure that all of the model creation is wrapped inside of a strategy.scope. And with these two lines, your program will now be able to run on all the GPUs available on the worker. And the key here is that the strategy.scope will take care of variable creation inside of your model, making sure that all the variables are mirrored on the different GPU devices. And the body of the strategy.scope will be distribution aware. So recall that one of the goals for the distribution strategy API was that it provides great out of box performance. So on this slide, I would like to convince you that it does, at least for mirrored strategy on the ResNet 50. So what we're looking at here is the performance of a ResNet 50 based training using Keras, running TensorFlow 2.0 on Google Cloud. The vertical axis of the graph plots images per second. And the horizontal axis ranges the number of GPUs from one to two to eight. And we can see that using mirrored strategy achieves close to linear scaling, starting with a single GPU achieving roughly 1,250 images per second to eight GPUs achieving close to 10,000 images per second. Now, we've covered the Estimator-- sorry, the Keras API usage with distribution strategy. Let's also cover the Estimator API usage. So this is a common example of how you would use the Estimator API for your training. Namely, you define a classifier using the Estimator constructor that you provision with a model function. And then, through the train call, you specify an input function, which can return, for instance, a tf.data data set. And it performs the training. In order to parameterize the Estimator API with a strategy, all you need to do is, again, to create an instance of the strategy, in this case, MirroredStrategy, and pass it in through the RunConfig option into the Estimator API. And once that happens, the RunConfig will actually-- with a strategy, will make sure that the model function is created once per replica. And replica, in this case, refers to the GPU. So you're going to have copies of the model on each GPU as well as of all the variables inside of the model. And you will perform the all-reduce synchronous training across the multiple GPUs. So distributing your computation across multiple GPUs on a single machine can get you up to N, where N is the number of accelerators attached to your machine, speed up. But there is a physical limit to how many accelerators you can have. And to go beyond that limit, the natural next thing is to actually use multiple machines with each one [? are ?] multiple accelerators. And that's what the multi-worker mirrored strategy is intended to help you with. And it's very similar to the mirrored strategy. The only difference is that instead of distributing your computation over GPUs on a single machine, it distributes the computation over GPUs on many machines. And it performs the all-reduce computation not just across GPUs on a single workstation, but across the GPUs on all the different workstations. And it does so through TensorFlow collective ops, which allows you to actually send data in a broadcast fashion between the different TensorFlow workers. The way you would use this API is similar to the mirrored strategy API. So you can create a default instance of a MultiWorkerMirroredStrategy. Or you can specify a specific CollectiveCommunication algorithm to be used. Unlike the mirror strategy, you also need to specify information about the different workers that are participating inside of your computation. And this is done so through a JSON encoded string that identifies the host and ports of your different workers as well as task types. The third strategy that I'm going to talk about, and that's available in the TensorFlow distribution strategy API is the TPU strategy. And this one is very similar to the mirrored strategy. The main difference is that it allows you to perform the all-reduce synchronous training on TPUs, which are the hardware accelerators made by Google specifically for TensorFlow. But at this point, there are also other frameworks that are capable of leveraging them. And you can do so through the Google Cloud platform. And unlike the mirrored strategy, it uses the cross_replica_sum to perform the all-reduce on TPUs, which is something that's a difference between GPUs and TPUs. And you can use this strategy for training on a single TPU or an entire pod, which that's a term that refers to a set of TPU cores in a topology. To use a TPU strategy is a little more complicated. And it's also somewhat of an area of active development, which the experimental portions of the API refer to. But the high level idea is that you create a TPU cluster resolver, which allows you to gather information about your TPU hardware. And then you create the TPU strategy with this cluster resolver argument, which then allows the TPU strategy to be aware of the TPU hardware location and specifics. And so up to this point, I've been talking about synchronous training, where all the devices in your training loop are performing or operating in a lock-step, one step at a time. An alternative to synchronous training, which might be suitable for certain types of machine learning tasks, is so-called asynchronous training, where the different devices or different workers in your set of workers performing your computation are actually running at different rates. And one of the architectures that enables asynchronous training is a so-called parameter server and worker architecture where your machines have one of two roles, parameter server tasks or worker tasks. The parameter server tasks is where global variable state is stored and either updated or fetched from by the individual workers. While the workers perform a dimension learning computation one step at a time, but not necessarily at the same rate. And this architecture can be targeted for your machine learning program using the parameter server strategy. You create it using this factoring. And similar to the multi-worker strategy, you need to specify information about the workers and the types of tasks that the worker machines should play, namely the worker task or the parameter server task. And again, this is done so through the TF_CONFIG environment variable. And a last strategy that I want to talk about and that's available through the distribution strategy API is the central storage strategy. This is a special case of the parameter server strategy where there is a single parameter server. And its role is being fulfilled by a CPU of the machine on which the other devices reside. And the benefit of this strategy is that any single GPU might not be able to fit all the embeddings, all the variable states inside of them. But the CPU might. And in cases where this is a good fit, the central storage strategy is available. And this is how you would create one. And that concludes the part talking about the Keras and Estimator API support, as well as the enumeration of the different types of strategies that are available in the tf.distribution strategy API. And in the last part of my talk, I'm going to talk about how would you go about distributing a model that you created using a custom training loop, which is effectively a model created out of lower level TensorFlow APIs? The prerequisite for your custom training loop to be distributable using the distribution strategy API is that it has to adhere to the following programming model. In particular, as far as data sources are concerned, your variables may be read from any replica. But the input data that's used for training will be sharded, meaning divided into disjoint sets that will be accessed exclusively by one replica. Each replica performs computation on its sources. And then the computation is combined using a reduction. So, in essence, this programming model is that of all-reduce synchronous training. But it can be implemented using lower level TensorFlow APIs. So let's take a look at how an example of a custom training loop distributed through a distribution strategy would look like. So we create an instance of a distribution strategy. And then we create a data set using your own create data set method that takes a batch size. The important bit here is that the batch size should be the global batch size, that is a batch size that you choose independently of the number of replicas or devices on which you are going to run your computation on. And it's going to be the responsibility of the distribution strategy API to actually divide this global batch size into per replica batch sizes. And this is done through the experimental_distribute_dataset invocation, which wraps the, quote unquote, sequential, data set in what's called a distributed data set. But as far as the custom training loop is concerned, there is no difference between the two. And your model, similar to the Keras API usage, should be created under the strategy.scope, which means that all the variables must be created under this scope so that they're properly mirrored across the different replicas. As an alternative to delegating the distribution of your data set to TF distribution strategy, you can also use an alternative API, distribute dataset from function, which gives the user the control to decide what portions of the data set should be distributed on which replica and how by, instead of providing a data set, you provide a data set factoring, which can input the distribution strategy context, which has information such as the particular replica index or the total number of replicas. And then in the rest of my presentation, we're going to take a look at how you would actually build a custom training loop in a kind of bottom up fashion. So the first thing, the lowest building block is the logic that performs a single training step on a replica. And here's an example of how you would do that. So you would use a GradientTape, perform some computation, and then with the [INAUDIBLE] of with the tape, compute gradients, apply them to model variables, and return the loss. And this is something that happens on a single replica. Now, to tie this computation across-- that happens on different replicas together in a single training epoch, you can enumerate the individual elements of the data set using Python iteration. And then use the run API of the distribution strategy with the replica step function and the input to collect the loss for that particular replica. And combine the individual losses using the reduce call with a particular reduce operation. And at that point, you could do any per-step processing inside of this for loop. For an example, you could print the loss. But you could do other types of computations here as well. Now, one thing you might notice is that this train_epoch function has a tf.function decorator. The effect of this decorator is that TensorFlow will interpret this Python function as a graph computation, optionally using autograph to convert Python idioms, such as the Python iteration of our data set, into equivalent graph building methods. And the reason we recommend using tf.function decorator for your train_epoch here is that it will generally result in much better performance. Because the entire training epoch will be executing as a data flow graph as opposed to a Python function. Now, the last step of your custom training loop is the iteration over multiple epochs, which is pretty straightforward. And this just illustrates how you do that, and optionally inserting per epoch processing inside of the outer loop, such as checkpointing your model or running an eval of the model. So before I end, I want to give you an overview of what's supported in TF 2.0 beta as far as distribution strategy is concerned. And this is a screenshot from the TensorFlow website. So you can either take a picture now, or you can also go to the website. In the first column, we see the three types of model building APIs, namely Keras, Estimator, and custom training loop. And then on the top row, we have the different types of strategies. And, as you can see, the Estimator API is well supported across different types of strategies while the other combinations are supported or on the way. Most of them are targeting the RC release candidate of 2.0 for availability. And that brings me to the end of my talk. Thank you very much for your attention. In case-- so throughout the talk, I've been sharing links to different tutorials. All the tutorials can be found on the TensorFlow web site under the resources link shown here. And in case you have any questions or you would like to request a feature or report issues, our GitHub repository is the correct forum for that. So thank you very much for your attention. [APPLAUSE] [MUSIC PLAYING]
B1 data strategy api pipeline tf data tf Inside TensorFlow: tf.data + tf.distribute 9 0 林宜悉 posted on 2020/04/04 More Share Save Report Video vocabulary