Subtitles section Play video Print subtitles ROHAN JAIN: Hi, all. I'm Rohan, and I'm here to talk to you about how you can scale up or input data processing with tf.data. So let's start with a high-level view of your ML training job. Typically, your ML training step will have two phases to it. The first is data preprocessing, where you're going to look at the input files and do all kinds of transformations on them to make them ready for the next phase, which is model computation. While you're doing data preprocessing, which happens in the CPU, you might be doing some kind of things such as-- for images, you're cropping them. For videos, you may be sampling them and whatnot. So if your training speed is slow, you could have a bottleneck in either one of these two places. And I hope that the talk on profiling would give you an indication on how to figure out which one of the two phases you're getting slow at. And I'm here to talk to you about the first kind of preprocessing bottleneck-- the bottleneck which is data preprocessing. So let's try to look into what this bottleneck really is. So in the last few years we've done a fantastic job making accelerators which do the ML operations really fast. And so the amount of time it takes us to do a matrix operation and all the linear algebra our operations is a lot smaller. But the hosts and the CPUs that feed the data to these accelerators have not been able to keep up with them, and so there ends up being a bottleneck. We thought that we could mitigate this by making the models more complex, but what happens is that the accelerators have constraints on how much RAM they have, and, more importantly, where you deploy these models tends to be something like a mobile device or something like that, which tends to restrict the amount of complexity you can introduce into your model. So that hasn't really panned out. The second approach people take is that they try to turn larger batch sizes. But larger batch sizes require a larger amount of preprocessing to assemble the batch, so then that puts further pressure on them. So that's why this is becoming an increasingly larger problem within Alphabet and even externally. And I'm going to talk to you about how you can solve it using tf.data. tf.data is TensorFlow's data preprocessing framework. It's fast, it's flexible, and it's easy to use. And you can learn more about it at our guide. For background for the rest of the talk, I think I'm going to go through a typical tf.data pipeline, and that'll help us in the later stages. So suppose you have some data in some tf.data record files which are your training data. So you can now start off with the TF record data set with that data. And then after that, you start doing your preprocessing. This is typically the bulk of the logic. So if it's images, you're doing cropping, maybe flipping, all sorts of things there. After that, you shuffle the data so that you don't train to the order in which you see the examples and the input. And that helps you with their training accuracy. And after that, we will batch it so that the accelerator can now make use of vectorized computations. Finally, you want to do some software pipelining so that you ensure that while the model is off working on one batch of data, the preprocessing side can produce the next batch so that everything works very efficiently. Finally, you can then feed this tf.data dataset to a Keras model, so that you can now start doing your training. So given that sort of basic pipeline, and suppose you have a bottleneck, the first thing I'd recommend you to do is to go through our single host performance guide, and try to utilize every trick and transformation that is available in tf-data to be able to extract the maximum possible performance, so that you're using all the [INAUDIBLE] and whatever. There's excellent information at the guide that we have here. And [INAUDIBLE] did a great talk at the ML Tokyo Summit, which you can take a look at to learn more about this. So that's the first thing I'd recommend you do. But suppose you have done that and you've tried all the different recommendations that we have here, but you're still bottlenecked on that data preprocessing part. And don't worry, you're not alone. This is very common. We've increasingly seen this with a lot of internal customers. And so now I'm very pleased to present a couple of solutions that we've been working on on the team to help you solve that problem. So the first idea is that why don't we just reuse the computation? So suppose you're playing around with different model architectures. Your input pre-processing sort of part kind of remains the same. And if it's expensive and time-consuming, why don't we just do it once, save it, and then every subsequent time, we just read from it, and do that quickly? So we noticed a bunch of internal customers, teams within Alphabet, who were trying to do this on their own outside of tf.data, and we decided to bring it in to tf.data and make it incredibly fast, flexible, and easy to use. And so this is what we call Snapshot. The idea is what I explained to you. You materialize the output of your data pre-processing once, and then you can use it many, many times. This is incredibly useful for playing around with different model architectures and if you settle down on an architecture doing hyperparameter tuning. And so you can get that speed up using Snapshot. Next, I'm going to go through the pipeline that we talked about before and see how you can add Snapshot to it to make it faster. So that's the original pipeline that we had. And so notice that there's this pre-processing step, which is expensive. So now with Snapshot, you just add a snapshot transformation right after that with a directory [INAUDIBLE].. And with this, everything that is before the snapshot will now be written to disk the first time it's run. And then every subsequent time, we will just read from it. And we would go through the rest of the steps as usual. One thing I'd like to point out is that we place the snapshot at a particular location before the shuffle, because if it's after the shuffle, everything gets frozen. So all the randomization that you get out of shuffle you lose, because every subsequent time, you're just going to be reading the same exact order again and again. So that's why we introduce it at that stage in the pipeline. So Snapshot, we developed it internally. There are internal users and teams that are using it and deriving benefit out of it. And now we're bringing it to the open source world. We published an RFC, which has more information about it and some other technical details. And this will be available in TensorFlow 2.3, but I believe it will be available in the [INAUDIBLE] shortly. So remember, I talked about two ideas. So the second idea is that, now, not all computation is reusable, so because suppose you had someone randomized crops in there. And if you wrote that to disk and read them back, you'd, again, lose that randomization. And so a snapshot is probably not applicable in that scenario. So the second idea is to be able to distribute the computation. So the initial setup is that you have one host CPU, which is driving a bunch of these accelerators, but now you can offload this computation from this host to maybe a cluster. And now you can utilize the ability and the computational power that you have for all these different workers to be able to feed the host, so that you're not bottlenecked on the input pre-processing anymore and things move fast. This is tf.data service. It's a tf.data feature that allows you to scale your workload horizontally. So if you're seeing a slowness in your input pre-processing, you can start adding workers, and it'll just scale up. It's got a master-worker architecture, where the master drives the work for the different workers and it gives you fault tolerance. So if one of the workers fails, you're still good and you still can make progress. So let's see how you can use the tf.data service for the example that we have. So here, instead of having sort of an expensive pre-processing, let's say you have some randomized pre-processing. So now this is not snapshotable, because if you snapshot, then you lose the randomization. So we'll provide you a binary which allows you to run the data service on the cluster setup manager that you like, whether it's Kubernetes or Cloud or something like that. And then once you have that up and running, you can just add a distribute transformation to your tf.data pipeline and provide the master address. Anything before the distribute transformation would now get run on the cluster that you have set up and everything after will run on the host. And so this allows you to sort of scale up. Again, note that because we are not doing any kind of freezing of the data, we can output this transformation as late as possible in there. So notice that I've put it after the shuffle transformation. The service, like Snapshot, has been developed with internal users. They've been using it. And it's been, like, a game-changer in terms of [INAUDIBLE] utilization. And now, again, we're bringing it to you. And so we published an RFC, which was well-received, and this should be available in 2.3 for you to play around with. So to summarize, what did I talk about today? So as with various trends in hardware and software, we've ended up in a scenario where a lot of input machine learning jobs are getting bottlenecked on input pre-processing. And I've told about two solutions that tf.data team has been working on to help you solve this bottleneck. First is Snapshot, which allows you to reuse your pre-processing, so that you don't have to do it multiple times. And the second is the tf.data service, which allows you to distribute this computation to a cluster, so that you get the scale-up that you need. I hope you play around with these and give us feedback. And thank you for your time. [MUSIC PLAYING]
B1 data tf data tf snapshot processing input Scaling Tensorflow data processing with tf.data (TF Dev Summit '20) 2 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary