Subtitles section Play video Print subtitles FRANK CHEN: So hi everyone. I'm Frank. And I work on the Google Brain team working on TensorFlow. And today for the first part of this talk, I'm going to talk to you about accelerating machine learning with Google Cloud TPUs. So the motivation question here is, why is Google building accelerators? I'm always hesitant to predict this, but if you look at the data, this has been-- the end of Moore's law has been going on for the past 10 or 15 years, where we don't really see the 52% year-on-year growth in single-threaded performance that we saw from the late 1980s through the early 2000s anymore, where now single-threaded performance for CPUs is really growing at a rate of about maybe 3% or 5% per year. So what this means is that I can't just wait 18 months for my machine learning models to train twice as fast. This doesn't work anymore. At the same time, organizations are dealing with more data than ever before. You have people uploading hundreds and hundreds of hours of video every minute to YouTube. People are leaving product reviews on Amazon. People are using chat systems, such as WhatsApp. People are talking about personal assistance and so on and so forth. So more data is generated than ever before. And organizations are just not really equipped to make sense of them to use them properly. And the third thread is that at the same time, we have this sort of exponential increase in the amount of compute needed by these machine learning models. This is a very interesting blog post by OpenAI. In late 2012, where we just had-- where deep learning was first becoming useful. We have like AlexNet, and we have Dropout, which used a fair amount of computing power, but not that much compared to in late 2017 where DeepMind published the AlphaGo Zero and AlphaGo. In the Alpha Zero paper, we see in about six, seven years, we see the compute demand increase by 300,000 times. So this puts a huge strain on companies' compute infrastructure. So what does this all mean? The end of Moore's law plus this sort of exponential increase in computer requirements means that we need a new approach for doing machine learning. At the same time, of course, everyone still wants to do compute, do machine learning, training faster and cheaper. So that's why Google is building specialized hardware. Now, the second question you might be asking is, what sort of accelerators is Google building? So from the title of my talk, you know that Google is building a type of accelerator that we call Tensor Processing Units, which are really specialized ASICs designed for machine learning. This is the first generation of our TPUs we introduced back in 2015 at Google I/O. The second generation of TPUs now called Cloud TPU version 2 that we introduced at Google I/O last year. And then these Cloud TPU version 2's can be combined into pods called Cloud TPU v2 Pods. And of course, at Google I/O this year, we introduced the third generation of cloud TPUs. From air cooled. Now it's liquid cooled. And of course, you can link a bunch of them up into a pod configuration as well. So what are the differences between these generations of TPUs? So the first version of TPUs, it was really designed for inference only. So it did about 92 teraops of innate. The second generation of TPUs does both training and inference. It operates on floating point numbers. It does about 180 teraflops. And it has about 64 gigs of HBM. And the third generation to TPUs, it's a big leap in performance. So now we are doing 420 teraflops. And we doubled the amount of memory. So now it's 128 gigs of HBM. And again, it does training and inference. And of course, we see the same sort of progress with Cloud TPU Pods as well. Our 2017 pods did about 11.5 petaflops. That is 11,500 teraflops of compute with 4 terabytes of HBM. And our new generation of pods does over 100 petaflops with 32 terabytes of HBM. And of course, the new generation of pods is also liquid cooled. We have a new chip architecture. So that's all well and good, but really, what we are looking for here is not just peak performance, but cost effective performance. So take this very commonly used image recognition model, called ResNet 50. If you train it on, again, a very common dataset called ImageNet, we achieve about 4,100 images per second on real data. We also achieve that while getting state of the art final accuracy numbers. So in this case, it's 93% top 5 accuracy on the ImageNet dataset. And we can train this ResNet model in about 7 hours and 47 minutes. And this is actually a huge improvement. If you look at the original paper by Kaiming He and others where they introduce the ResNet architecture, they took weeks and weeks to train one of these models. And now with one TPU, we can train it in 7 hours and 47 minutes. And of course, these things are available on Google Cloud. So the current training, so it takes about-- if you pay for the resource on demand, it's about $36. And if you pay for it using Google Cloud's preemptible instances, it is about $11. So it's getting pretty cheap to train. And of course, we want to do the cost effective performance at scale. So if you're trying the same model, ResNet 50, on a Cloud TPU version 2 Pod, you are getting something like 219,000 images per second of training performance. You get the same finer accuracy. And training time goes from about eight hours to about eight minutes. So again, that's a huge improvement. And this gets us into the region of we can just iterate on-- you can just go train a model, go get a cup of coffee, come back, and then you can see the results. So it gets into almost interactive levels of machine learning, of being able to do machine learning research and development. So that's great. Then the next question will be, how do these accelerators work? So today we are going to zoom in on the second generation of Cloud TPUs. So again, this is what it looks like. This is one entire Cloud TPU board that you see here. And the first thing that you want to know is that Cloud TPUs are really network-attached devices. So if I want to use a Cloud TPU on Google Cloud, what happens is that I create it. I go to the Google Cloud Console, and I create a Cloud TPU. And then I create a Google Compute Engine VM. And then under VM, I just have to install TensorFlow. So literally, I have to do PIP install TensorFlow. And then I can start writing code. I don't have drivers to install. You can use a clean Ubuntu image. You can use the machine learning images that we provide. So it's really very simple to get started with. So each TPU is connected to a host server with 32 lanes of PCI Express. So each TPU-- so the thing here to note is that the TPU itself is like an accelerator. So you can think of it like GPUs. So it doesn't run. You can't run Linux on it by itself. So it's connected to the host server by 32 lanes of PCI Express to make sure that we can transfer training data in. We can get our results back out quickly. And of course, you can see on this board clearly there are four fairly large heat sinks. Underneath each heat sink is a Cloud TPU chip. So zooming in on the chip, so here's a very simplified diagram of the chip layout. So as you can see, each chip has two cores. It's connected to 16 gigabytes of HBM each. And there are very fast interconnects that connect these chips to other chips on the board and across the entire pod. So each chip does about-- each core does about 22.5 teraflops. And each core consists of a scalar unit, a vector unit, and a matrix unit. And we are operating mostly on full 32's with one exception. So zooming on a matrix unit, this is where all the dense matrix math and dense convolution happens. So the matrix unit is implemented as a 128 by 128 systolic array that does bfloat16 multiplies and float32 accumulates. So there are two terms here that you might not be familiar with, bloat16 and systolic arrays. So I'm going to go through each of these in turn. So here's a brief guide to floating point formats. So if you are doing machine learning training and inference today, you're probably using fp32, or what's called single-precision IEEE floating point format. So in this case, you have one signed bit, eight exponent bits, and about 23 significant bits. And that allows you to represent a range of numbers from 10 to the negative 38 to about 10 to the 38. So it's a fairly wide range of numbers that you can represent. So in recent years, people have been trying to train neural networks on fp16, or what's half-precision IEEE floating point format. And people at TensorFlow and across the industry have been trying to make this work well and seamlessly, but the truth of the matter is you have to make some modifications to many models for it to train properly if you're only using fp16, mainly because of issues like managing gradient, or you have to do log scaling, all sorts of things. And the reason is because the range of representable numbers for fp16 is much narrower than for fp32. So the range here is just from about 6 to the 6 times 10 to the negative 8 to about 65,000. So that's a much narrower range of numbers. So what did the folks at Google Brain do? So what Google Brain did is that we came up with a floating point format called bfloat16. So what bfloat16 is, it is like float32, except we drop the last 16 bits of mantissa. So this results in the same bit, the same exponent bits, but only 7 bits of mantissa instead of 23 bits. In this way we can represent the same range of numbers, just at a much lower position. And it turns out that you don't need all that much precision for neural network training, but you do actually need all the range. And then the second term is systolic arrays. So rather than trying to describe what a systolic array is, I will just show you a little animation I made up. So in this case, we are computing y equals a very simple matrix times vector computation. So you're computing y equals w times x, where w is a 3-by-3 matrix and x is a three-element vector. And we are computing x with a batch size of three. So we have already loaded all the weights into the matrix unit. And if we start the first clock cycle, you'll see that the first element of the first vector is loaded into the matrix unit. And then we multiply the position 1, 1 of w with the first element of the first vector. In the second clock cycle, what happens is that more weights are loaded. So we are doing more multiplications. At the same time, we are pushing the results from the previous round of multiplication [INAUDIBLE].. So that in the case of the yellow box right there, we are not just doing the multiplication. We are also summing the result of the multiplication that happens within the box with the result from the box to the left of it. And then this continues. As you can see, you are utilizing a lot more compute now until you get the outputs out. So what this effectively is, is a 2D field of compute. So it allows us to put a lot of compute units within a very small amount of chip area. So if we optimize on the cost of the chip, because the bigger the chip, the bigger-- the higher the cost. And with a chip architecture that's also built for pipelining-- that is we can fill the-- so in this previous example, we only had a batch size of three. But if you have bigger batch sizes, if your chip architecture is built for this, you can just always fill the matrix units. And this means that we get very high throughput for our matrix multiplications, which is really at the heart of a lot of these deep learning models. So OK, cool. How do I use these accelerators? So our recommendation is that you start with our Cloud TPU Reference Models. These are high performance, open source models. They are licensed under, I think, the Apache license. They implement very common and also cutting-edge model architectures. Internally, we test them for performance and accuracy. And you can use these and get up and running really quickly. And you can modify them as needed. So you can train and run, of course, on assembled data, on your own data, and so on and so forth. And we have a lot of reference models. So I gave you examples of ResNet 50 and other image recognition networks, but you can also do things like machine translation, language modeling, speech recognition, image generation. We have all these models just as sample models for cloud TPUs if you want to get started with them. Great. So remember these? Remember those pods? It turns out for a lot of our models, we have not only optimized them for single TPUs, we've also optimized for TPU pods. For instance, take the ResNet 50 example that I quoted performance figures for earlier. In this case, you've got training on a single Cloud TPU. This is really literally all you do. You download the-- you start a TPU. You download TensorFlow. You clone the Git repository. And then you just basically call Python, and just say, point it to the TPU. Point it where our data is. Tell me what the batch size is. Tell me how many steps you want to train for. And then bam. Off you go. It turns out that training on the Cloud TPU Pod is not that different. Instead of starting a Cloud TPU, you start a Cloud TPU pod. And really, the only things you have to modify is the name of the TPU, the training batch size, and the number of training steps. So the reference model-- so in this case, the reference model for ResNet 50 uses like fairly recent techniques, such as the LARS optimizer and label smoothing to achieve the target accuracy so that you don't have to re-implement all these changes. We have already done it for you. So a lot of our reference models scale up from one TPU all the way to a pod. So of course, you aren't limited to reference models. So when you build your own models, of course, you build them with TensorFlow. And when you build models with TensorFlow, there are really two things that you have to think about. There is the thing that most people focus their energy on, which is the network architecture itself, which is running on the accelerator. But a lot of what people neglect is the input pipeline. So basically, moving our training data, reading them, decompressing them, parsing them, performing data augmentation, and patching them, and then sending it into the accelerators. A lot of people don't think about this as a problem, but really, for these sort of high performance accelerators, this sort of limits performance, because if your training pipelines slow, then accelerator is just idle half the time. So phase one, build an input pipeline. So this is a very simple input pipeline for ResNet 50. So you have an input function. You list a bunch of files. You shuffle them. You repeat them. And then you send it out. So this is great. Guess what the performance of this. This is 150 images per second. So even if you run this on the Cloud TPU, you're getting 150 images per second for training, which is not great, because Cloud TPUs can do 4,000 images per second. So what you do? You have a bottleneck. So how do you improve performance? You find the bottleneck. You optimize the bottleneck. And of course, you repeat until you get the desired performance. And Cloud TPUs actually provide a fairly comprehensive set of profiling tools. So in this case, you can see what's-- in this case, this is TensorBoard. So you can bring up a profile of what's happening on your TPU, on the host, and so on and so forth. And then you can see that, oh, there are large gaps. So this means that the CPU is idle waiting for data. And this is not great. So a simplified like representation of what's happening on TensorBoard right now is something like this. So in this case, we have an extract. We have a transformer with a load. And then we have the training on the accelerator. And they are all happening sequentially. And this is not great, right? Because what is really happening here is that you're leaving the CPU idle. And you're leaving the accelerator idle. And these two things are the biggest cost factors in your training pipeline. So what you really want to do is to do something like this. You're overlapping every single step. And you are utilizing all of the expensive bits in your computer to the fullest extent. So the accelerator is 100% utilized. The CPU is only idle slightly. And the disk is idle, but that's fine. And to do pipelining is really easy. So you just have to really modify one thing. So you see the second to last line. Instead of doing-- just do dataset.prefetch. And this just ensures that everything above is pipeline with accelerator training. And of course, you also want to do parallel reads, because reading from many files is faster than reading from one. And there are many other techniques that I won't go into today, because I don't have time. So you can use sloppy interleave, fused dataset operators. We have a good performance guide on the TensorFlow website that tells you how you can optimize your input pipelines. I encourage you to take a look. But this is sort of a partially optimized input pipeline. It's slightly longer than our simple one, but this does over 2,000 images per second. And if you want the fully optimized image pipeline, please take a look at our TPU sample code. OK. Cool. Now comes the fun part, building your model. So the first way you can build your model is actually with Keras. So we have experimental Keras integration available starting with TensorFlow 1.11, which will be coming out in about two to three weeks. So you can build-- so you can write your models in Keras as per normal. And the only real thing that you have to modify is basically create what's called a cluster resolver, give it a name, create a distribution strategy, and call the keras_to_tpu_model function. And this will transform your model to something that's compatible for the TPU. And then after that, you can just do the simple sort of model.compile, model.fit, and all the Keras goodness that you know and love. And in TensorFlow 1.12, which is the release after this, we are going to make it even easier. So you don't even have to call keras_to_tpu_model anymore. You can just call model.compile directly. And then this will work. Great. You don't want to use Keras. You want to use something lower level. So we also have a solution for that. You can use something called TensorFlow Distribution Strategy. I think there was a talk about Distribution Strategy yesterday. So if you missed that, I think the video will be online soon. So you should take a look at that. So in this case, this is using the estimator of Distribution Strategy. So you can write your model function like you see on the left. You can write your input function like you see on the top right. And again, the only thing you really have to modify is a couple lines. Again, create a cluster resolve, or create a TPU strategy. And then you can just pass it in through the estimator function, train.distribute. So this will let it work on TPUs. So that's all great. And so are people using these TPUs? People are, in fact. So here's a case study of an architecture search project that's done by a group from Stanford and MIT. So they did parallel runs using hundreds and hundreds of cloud TPUs from the TensorFlow Research Cloud Program, which is where we are providing 1,000 free TPUs to academic researchers. So if you're academic researchers, I encourage you to look into this program. So each blue dot in this image is a run on a TPU training an ImageNet scale, a convolution RNN. So each run used to take hours and hours to train on other hardware, but on TPUs, because they have access to so many TPUs, they can do hundreds and hundreds of these runs. So what they were trying to do was that they were trying to search for a model that was a better fit for the data that you record, say, if you put electrodes in my brain and look at what my visual cortex is trying to do when I look at things. So they are trying to find analogs, trying to find a neural network that was a closer analog to the primate visual cortex. So it turns out that-- so here's a diagram of the space that they were searching. And it turns out that across a population of many different models, they found that the red connections were sort of selected for the search versus the others. And what happens is that they went back and compared the models to some of the signals that were recording, that the biologists were recording, and they found that the convolution RNNs were a much better fit for neural signals, for instance, in v4, in IT, than in other [INAUDIBLE],, like convolution, or feet forward models that you see in the literature today. So this is a really new and exciting direction that a research group was able to do from scratch with access to lots of compute. So you can not just train models on TPUs, you can search for them basically automatically, too. And so, finally, of course, Cloud TPUs today, Cloud TPU version 2 today is available, is generally available on Google Cloud. If you want to learn more about them, go to cloud.google.com/tpu to get started. All right. So now Alex will present some new functionality that lets you write the accelerator code more easily. Alex.
B1 tpu cloud tpus training performance matrix Cloud TPUs (TensorFlow @ O’Reilly AI Conference, San Francisco '18) 3 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary