Subtitles section Play video Print subtitles [MUSIC PLAYING] MARTIN GORNER: Hi, everyone, and thank you for being here at 8:30 in the morning, and welcome to this session about TPUs and TPU pods. So those are custom made accelerators that Google has designed to accelerate machine learning workloads. And before I tell you everything about them, me and Kaz, I would like to do something. Of course, this is live, so you want to see a live demo. And I would like to train with you here, onstage, using a TPU pod, one of those big models that used to take days to train. And we'll see if we can finish the training within this session. So let me start the training. I will come back to explaining exactly what I'm doing here. I'm just starting it. Run all cells. Seems to be running. OK, I'm just checking. I'm running this on a 128 core TPU pod. So one of the things you see in the logs here is that I have all my TPUs appearing. 0, 1, 2, 6, and all the way down to 128. All right, so this is running. I'm happy with it. Let's hear more about TPUs. So first of all, what is this piece of silicon? And this is the demo that I've just launched. It's an object detection demo that is training on a wildlife data set of 300,000 images. Why wildlife? Because I can show you cute pandas. And I can show you cute electronics as well. So this is a TPU v2. And we have a second version now, a TPU v3. Those are fairly large boards. It's large like this, roughly. And as you can see, they have four chips on them. Each chip is dual core, so each of these boards has 8 TPU cores on them. And each core has two units. That is a vector processing unit. That is a fairly standard data-oriented processor, general purpose processor. What makes this special for machine learning is the matrix multiply unit. TPUs have a built-in hardware-based matrix multiplier that can multiply to 128 by 128 matrices in one go. So what is special about this architecture? There are two tricks that we used to make it fast and efficient. The first one is, I would say, semi-standard. It's reduced precision. When you train neural networks, reducing the precision from 32-bit floating points to 16-bit is something that people quite frequently do, because neural networks are quite resistant to the loss of precision. Actually, it even happens sometimes that the noise that is introduced by reduced precision acts as a kind of regularizer and helps with convergence. So sometimes you're even lucky when you reduce precision. But then, as you see on this chart, float16 and float32, the floating point formats, they don't have the same number of exponent bits, which means that they don't cover the same range. So when you take a model and downgrade all your float32s into float16s, you might get into underflow or overflow problems. And if it is your model, it's usually not so hard to go in and fix. But if you're using code from GitHub and you don't know where to fix stuff, this might be very problematic. So that's why on TPUs, we chose a different-- actually we designed a different floating point format called bfloat16. And as you can see, it's actually exactly the same as float32 with just the fractional bits cut off. So the point is it has exactly the same number of exponent bits, exactly the same range. And therefore, usually, it's a drop-in replacement for float32 and reduced precision. So typically for you, there is nothing to do on your model to benefit from the speed of reduced precision. The TPU will do this automatically, on ship, in hardware. And the second trick is architectural. It's the design of this matrix multiply unit. So that you understand how this works, try to picture, in your head, how to perform a matrix multiplication. And one result, one point of the resulting matrix, try to remember calculus from school, is a dot product. A dot product of one line of one matrix and one column of the second matrix. Now what is a dot product? A dot product is a series of multiply-accumulate operations, which means that the only operation you need to perform a matrix multiplication is multiply and accumulate. And multiply-accumulate in 16 bits, because we're using bfloat16 reduced precision. That is a tiny, tiny piece of silicon. A 16-bit multiply-accumulator is a tiny piece of silicon. And if you wire them together as an array, as you see here. So this in real life would be a 128 by 128 array. It's called a systolic array. Systolic in Greek means flow. Because you will flow the data through it. So the way it works is that you load one matrix into the array, and then you flow the second matrix through the array. And you'll have to believe me, or maybe spend a little bit more time with the animation, by the time the gray dots have finished flowing through those multiply-accumulators, out of the right side come all the dot products that make the resulting matrix. So it's a one-shot operation. There are no intermediate values to store anywhere, in memory, in registers. All the intermediate values flow on the wires from one compute unit to the second compute units. It's very efficient. And what is more, it's only made of those tiny 16-bit multiply-accumulators, which means that we can cram a lot of those into one chip. 128 by 128 is 16,000 multiply-accumulators. And that's how much you get in one TPU core, twice that in two TPU cores. So this is what makes it dense. Density means power efficiency. And power efficiency in the data center means cost. And of course, you want to know how cheap or how fast these things are. Some people might remember from last year I did a talk about what I built, this planespotting model, so I'm using this as a benchmark today. And on Google Cloud's AI platform, it's very easy to get different configurations, so I can test how fast this trains. My baseline is-- on a fast GPU, this model trains in half in four and a half hours. But I can also get 5 machines with powerful GPUs in a cluster. And on those five machines, five GPUs, this model will train in one hour. And I've chosen this number because one hour is exactly the time it takes for this model to train on one TPU v2. So the rule of thumb I want you to remember is that roughly 1 TPU v2, with its 4 chips, is roughly as fast as five powerful GPUs. That's in terms of speed. But as you can see, it's almost three times cheaper. And that's the point of optimizing the architecture specifically for neural network workloads. You might want to know how this works in software as well. So when you're using TensorFlow, or Keras in TensorFlow, your Python code TensorFlow Python code generates a computational graph. That is how TensorFlow works. So your entire neural network is represented as a graph. Now, this graph is what is sent to the TPU. Your TPU does not execute Python code. This graph is processed through XLA, the Accelerated Linear Algebra compiler, and that is how it becomes TPU microcode to be executed on the TPU. And one nice side-effect of this architecture is that if, in your TensorFlow code, you load your data through the standard tf.data.Dataset API, as you should, and as is required with TPUs, then even the data loading part, or imagery resizing, or whatever is in your data pipeline, ends up in the graph, ends up executed on the TPU. And the TPU will be pulling data from Google Cloud Storage directly during training. So that is very efficient. How do you actually write this with code? So let me show you in Keras. And one caveat, this is Keras in TensorFlow 1.14, which should be out in these next days. The API is slightly different in TensorFlow 1.13 today, but I'd rather show you the one that will be-- the new one, as of tomorrow or next week. So it's only a couple of lines of code. There is the first line, TPUClusterResolver. You can call it without parameters on most platforms, and that finds the connected TPU. The TPU is a remotely-connected accelerator. This finds it. You initialize the TPU and then you use the new distribution API in TensorFlow to define a TPU strategy based on this TPU. And then you say with strategy.scope, and everything that follows is perfectly normal Keras code. Then you define your model, you compile it, you do model.fit, model.evaluate, model.predict, anything you're used to doing in Keras. So in Keras, it's literally these four lines of code to add-- to work on a TPU. And I would like to point out that these four lines of code also transform your model into a distributed model. Remember a TPU, even a single GPU, is a board with eight cores. So from the get go it's distributed computing. And these four lines of code put in place all the machinery of distributed computing for you. One parameter to notice. You see in the TPU strategy, there is the steps_per_run equals 100. So that's an optimization. This tells the TPU, please run 100 batches worth of training and don't report back until you're finished. Because it's a network attached accelerator, you don't want the TPU to be reporting back after each batch for performance reasons. So this is the software. If you don't want to write your own code, I encourage you to do so. But if you don't, we have a whole library of TPU optimized models. So you will find them on the TensorFlow/tpu GitHub repository. And there is everything in the image-- in the vision space, in the machine translation, and language, and NLP space, in speech recognition. Even you can play with GaN models. The one that we are demoing on stage, remember we are training the model right now, is RetinaNet. So this one is an object detection model. And I like this model, so let me say a few words about how this works. In object detection, you put an image, and what you get is not just the label-- this is a dog, this is a panda-- but you actually get boxes around where those objects are. In object detection models, you have two kinds. There are one shot detectors that are usually fast but kind of inaccurate, and then two-stage detectors that are much more accurate but much slower. And I like RetinaNet because they actually found a trick to make this both the fastest and the most accurate model that you can find in object detection today. And it's a very simple trick. I'm not going to explain all the math behind it, but basically in these detection models, you start with candidate detections. And then you prune them to find only the detections-- the boxes that have actual objects in them. And the thing is that all those blue boxes that you see, there is nothing in them. So even during training, they will very easily be classified as nothing to see, move along boxes, with a fairly small error. But you've got loads of them, which means that when you compute the loss of this model, in the loss you have a huge sum of very small errors. And that huge sum of very small errors might in the end be very big and overwhelm the useful signal. So the two-stage detectors resolve that by being much more careful about those candidate boxes. In one-stage detectors, you start with a host of candidate boxes. And the trick they found in RetinaNet is a little mathematical trick on the loss to make sure that the contribution of all those easy boxes stays small. The upshot, it's both fast and accurate. So let me go back here. I actually want to say a word about now what I did, exactly, when I launched this demo. I guess most of you are familiar with the Google Cloud Platform. So here I am opening the Google Cloud Platform console. And in the Google Cloud Platform, I have a tool called AI platform, which, for those who know it, has had a facility for running training jobs and for deploying models behind the REST API for serving. But there is a new functionality called Notebooks. In AI platform, you can today provision ready all-installed notebook for working in-- yeah, so let me switch to this one-- for working either in TensorFlow, in PyTorch, with GPUs. It's literally a one click operation. NEW INSTANCE, I want a TensorFlow instance with Jupyter notebook installed, and what you get is here an instance that is running but with the link to open Jupyter. For example, this one-- and it will open Jupyter, but it's already open. So it's asking me to select something else, but it's here. And here, you can actually work normally in your Jupyter environment with a powerful accelerator. You might have noticed that I don't have a TPU option, actually not here, but here, for adding an accelerator. That's coming. But here I am using Jupyter notebook instances that are powered by a TPU v3 128-core pod. How did I do it? It's actually possible on the command line. I give you the command line here. There is nothing fancy about it. There is one gcloud compute command line to start to the instance and a second gcloud compute command line to start the TPU. You provision a TPU just as you would a virtual machine in Google's cloud. So this is what I've done. And that is what is running right now. So let's see if what we are. Here it's still running. As you see enqueue next 100 batches. And it's training. We are step 4,000 out of 6,000 roughly. So we'll check back on this demo at the end of the session. This demo, when I was doing it, to run it on stage, I've been able also to run a comparison between how fast our TPU v3s versus v2s. In theory, v3s are roughly twice as powerful as v2s, but that only works if you feed them enough work to make use of all the hardware. So here on RetinaNet, you can train on images of various sizes. Of course, if you train on smaller images, 256 pixel images, it will be much faster, in terms of images per second. And I've tried both-- TPU v2s and v3s. You see with small images, you get a little bump in performance from TPU v3s, but nowhere near double. But as you get to bigger and bigger images, you are feeding the hardware with more work. And on 640 pixel images, the speed up you get from TPU v3 is getting close to the theoretical x2 factor. So for this reason, I am running this demo here at the 512 pixel image size on a TPU v3 pod. I'm talking about pods. But what are these pods, exactly? To show you more about TPU pods, I would like to give the lectern to Kaz. Thank you Kaz. KAZ SATO: Thank you, Martin. [APPLAUSE] So in my part, I directly introduce Cloud TPU pods. What are pods? It's a large cluster of Cloud TPUs. The version two pod is now available as public beta, which provides 11.6 petaflops, with 512 TPU cores. The next generation version three pod is also public beta now, which achieves over 100 petaflops with 2,048 TPU cores So those performance numbers are as high as the greatest supercomputers. So Cloud TPU pods are AI supercomputer that Google have built from scratch. But some of you might think, what's the difference between a bunch of TPU instances and a Cloud TPU pod? The difference is the interconnect. Google has developed ultra high-speed interconnect hardware derived from a supercomputer technology, for connecting thousands of TPUs with very short latency. What does it do for you? As you can see on the animation, every time you update a single parameter on a single TPU, that will be synchronized with all the other thousands of TPUs, in an instant, by the hardware. So in short, TensorFlow users can use the whole pod as a single giant machine with thousands of TPU cores inside it. It's as easy as using a single computer. And you may wonder, because it's an AI supercomputer, you may also take super high cost. But it does not. You can get started with using TPU pods with 32 cores at $24 per hour, without any initial cost. So you don't have to pay millions of dollars to build your own supercomputer from scratch. You can just rent it for a couple of hours from the cloud. Version three pod also can be provisioned with 32 cores. That costs only $32 per hour. For larger sizes, you can ask our service contact for the pricing. What is the cost benefit of a TPU pods over GPUs? Here's a comparison result. With a full version two pod, with 512 TPU cores, you can train the same ResNet-50 models at 27 times faster speed at 38% lower cost. This shows the clear advantage of the TPU pods to a typical GPU-based solutions. And there are other benefits you could get from the TPU pods. Let's take a look at eBay's case. eBay has over 1 billion product listings. And to make it easier to search specific products from 1 billion products, they built a new visual search feature. And to train the models, they have used 55 million images. So it's a really large scale training for them. And they have used Cloud TPU pods, and eBay was able to get a 100 times faster training time, compared with existing GPU service. And they will also get a 10% accuracy boost. Why is that? TPU itself is not designed to increase the accuracy that much. But because if you can't increase the training speed 10 times or 100 times, that means the data scientists or researchers can have 10 times 100 times more iterations for the trials, such as trying out a different combination of the hyperparameters or different preprocessings and so on. So that ended up at least 10% accuracy boost in eBay's case. Let's see what kind of TensorFlow code you would write to get those benefits from TPU pods. And before taking a look at the actual code, I try to look back. What are the efforts required, in the past, to implement the large scale distributed training? Using many GPUs or TPUs for a single machine running training, that is so-called distributed training. And there are two ways. One is data parallel and another is model parallel. Let's talk about the data parallel first. With data parallel, as you can see on the diagram, you have to split the training data into the multiple GPU or TPU nodes. And also you have to share the same parameter set, the model. And to do that, you have to set up a cluster of GPUs or TPUs by yourself. And also you have to set up a parameter server that shares all the updates of our parameters among all the GPU or TPUs. So it's a complex setup. And also in many cases, you have to-- there's going to be synchronization overhead. So if you have hundreds or as thousands of the TPUs or GPUs in a single cluster, that's going to be a huge overhead for that. And that limits the scalability. But with TPU pods, the hardware takes care of it. Your high-speed interconnect synchronizes all of the parameter updates in a single TPU with the other thousands of TPUs in an instant, with very short latency. So there's no need to set up the parameter server, or there's no need to set up the large cluster of GPUs by yourself. And also you can get almost linear scalability to add more on the more TPU cores in your training. And Martin will show you the actual scalability result later. And as I mentioned earlier, TensorFlow users can use the whole TPU pods as a single giant computer and with thousands of TPU cores inside it. So it's as easy as using a single computer. For example, if you have Keras code running on a single TPU, it also runs on a 2,000 TPU cores without any changes. This is exactly the same code Martin showed earlier. So under the hood, all the complexity for the data parallel training, such as splitting the training data into the multiple TPUs, or the sharing the same parameters, those are all taken care of by the TPU pods' interconnect, and XLA compilers, and the new TPUStrategy API in the TensorFlow 1.14. The one thing you may want to change is the batch size. As Martin mentioned, a TPU core has a matrix processor that has 128 by 128 matrix multipliers. So usually, you will get the best performance by setting in the batch size to 128 times the number of TPU cores. So if you have 10 TPU cores, that's going to be 1,280. The benefit of TPU pods is not only the training times. It also enables the training of giant modules by using gear the Mesh TensorFlow. Data parallel has been a popular way of distributed training, but there's one downside. It cannot train a big model. Because all departments are shared with all the GPUs or TPUs, you cannot bring a big model that doesn't fit into the memory of a single GPU or a TPU. So there's another way of distributed training called a model parallel. With model parallel, you can split the giant model into the multiple GPUs or TPUs so that you can train much larger models. But that has not been a popular way. Why? Because it's much harder to implement. As you can see on the diagrams, you have to implement all the communications between the fraction of the models. It's like stitching between the models. And again, you have to set up the complex cluster, and in many cases, the communication between the models. Because if you have hundreds of thousands of CPU or GPU or TPU cores, then that's going to be a huge overhead for that. So those are the reasons why model parallel has not been so popular. To solve those problems, TensorFlow team has developed a new library called Mesh TensorFlow. It's a new way of distributed training, with the multiple computing nodes, such as TPU pods, or multiple GPUs, or multiple CPUs. TensorFlow provides an abstraction layer that sees those computing nodes as a logical n-dimensional mesh. Mesh TensorFlow is now available as open source code on the TensorFlow GitHub repository. To see how it works with imaging, you could have a simple neural network like this for recognizing the MNIST model. This network has the batch size as 512, and data dimension as 784, and one hidden layer with 100 nodes, and output as 10 classes. And if you want to train that network with the model parallel, you can just specify, I want to split the parameters into four TPUs to the Mesh TensorFlow, and that's it. You don't have to think about how to implement the communication between the split model and how to worry about the communication overhead. What kind of a code you would write? Here is the code to use the model parallel. At first, you have to define the dimensions of both data and the model. In this code, you are defining the batch dimension as 512, and the data has a 784 dimensions, and hidden layer has 100 nodes, and the 10 classes. And then you define your own network by using Mesh TensorFlow APIs, such as two sets of weights and one hidden layers, and one logits and loss function, by using those dimensions. Finally, you define how many TPU or GPUs have in the mesh, and what is the layout rule you want to use. In this code example, it is defining a hidden layer dimensions for splitting the model parameters into the four TPUs. And that's it. So that the Mesh TensorFlow can take a look at this code and automatically split the model parameters into the four TPUs. And it shares the same training data with all the TPUs. You can also combine both data and the model parallel. For example, you can define the 2D mesh like this. And you use the rows of the mesh for the data parallel and use the column of the mesh or the model parallel, so that you can get the benefits from both of them. And again, it's easy to define with Mesh TensorFlow. You can just specify batch dimension for the rows and hidden layer dimensions for the columns. This is an example where you are using the Mesh TensorFlow for training a transformer model. Transformer model is a very popular language model, and I don't go deeper into the transformer model. But as you can see, it's so easy to map each layer of a transformer model to the layer load of Mesh TensorFlow so that you can efficiently map the large data and large model into the hundreds of thousands of TPU cores by using Mesh TensorFlow. So what's the benefit? By using the Mesh TensorFlow running with to TPU pods, the Google AI team was able to train the language module and translation model with the billion word scale. And they were able to achieve the state-of-the-art scores, as you can see on those numbers. So for those use cases, the larger the model, the better accuracy you get. The model parallel with TPU pods give the big advantage on achieving those state-of-the-art scores. Let's take a look at another use case of the large scale model parallel I just call BigGAN. And I don't go deeper into what is GAN or how the GAN works. But here's the basic idea. You have the two defined networks. One is called discriminator D and another is called generator G. And you define a loss function so that the D to be trained to recognize whether an image is a fake image or real image. And at the same time, the generator will be trained to generate a realistic image so that a D cannot find it's a fake. It's like a minimax game you are playing with those two networks. And eventually, you will have a generic G that can generate a photo-realistic fake images, artificial images. Let's take a look at the demo video. So this is not big spoiler. I have already loaded the bigger models that is trained on the TPU pod. And as you can see, these are all the artificial synthesized image at high quality. You can also specify the category of the generated images, such as ostrich, so that you can generate the ostrich images. These are all synthesized artificial images. None of them are real. And because BigGAN can have the so-called latent space that has the seeds to generate those images, you can interpolate between two seeds. In this example, it is interpolating between golden retriever and Lhasa. And you can try out a different combination of the interpolation, such as west highland white terrier and golden retriever. Again, those are all fake images. So this bigger model was trained with the TPU version three pod with 512 cores. And that took 24 hours to 48 hours. Why BigGAN takes so many TPU cores and so long time? The reasons are the model size and the batch size. The quality of a GAN model, measured by GAN model, are measured by the inception score, or IS score. That represents how much an inception model thinks those images are real. And that also represents the variety of generated images. The BigGAN paper says that you get better IS score when you are having more parameters in the model and when you are using the larger batch size for the training. So that means the larger scale model parallel on the hundreds of TPU cores is crucial for BigGAN model to increase the quality of those generated images. So we have seen two use cases. A BigGAN use case and language model use cases. And those are the first applications of the model parallel on TPU pods. But they are only the starters. So TPU pods are available to everyone from now. So we expect to see more and more exciting use cases coming from the new TPU pods users and also from the applications. So that's it for my part. Back to Martin. MARTIN GORNER: So now it's time to check on our demo. Did our model actually train? Checking here, yeah, it looks like it has finished training. A saved model has been saved. So the only thing that is to do is to verify if this model can actually predict something. So on a second machine I will reload the exact same model. OK. I believe that's the one. And let's go and reload it. So I'll skip training this time and just go here to inference and loading. Whoops, sorry about that. I just hope the demo gods will be with me today. All right. That's because I'm loading the wrong directory. The demo gods are almost with me. It's this one where my model has been saved. All right. Yes. Indeed. It wasn't the same. Sorry about that. No training, just inference. And this time, it looks like my model is loading. And once it's loaded, I will see if it can actually detect animals in images, and here we are. So this leopard is actually a leopard. This bird is a bird. The lion is a lion. This is a very tricky image. So I'm showing you not cherry-picked images. This is a model I have trained on stage, here with you. No model is perfect. We will see bad detections, like this one. But that's a tricky one. It's artwork. It's not an actual lion. The leopard is spot on. The lion is spot on. And see that the boxing actually works very well. The leopard has been perfectly identified in the image. So let's move to something more challenging. Even this inflatable artwork lion has been identified, which is not always the case. This is a complicated image-- a flock of birds. So you see it's not seeing all of them. But all of them at least are birds, which is a pretty good job. The leopard is fine. Oh, and this is the most complex we have. There is a horse and cattle. Well, we start seeing a couple of bad detections here. Of course, that cow is not a pig. As I said, no model is perfect. But here the tiger is the tiger, and we have our two cute pandas. And those two cute pandas are actually quite difficult, because those are baby pandas. And I don't believe that this model has had a lot of baby animals in its 300,000 images data set. So I'm quite glad that it managed to find the two pandas. So moving back, let me finish by giving you a couple of feeds and speeds on those models. So here, this model has a RetinaNet 50 backbone, plus all the detection layers that produced the boxes. And we have been training it on a TPU v3 pod with 128 cores. It did finish in 20 minutes. You don't have to just believe me for that. Let me show you. Here I had a timer read my script. Yep, 19 minutes and 18 seconds. So I'm not cheating. This was live. But I could also have run this model on a smaller pod. Actually, I tried on a TPU v2-32. On this chart, you see the speed on this axis and the time on this axis. This is to show you that a TPU v2-32 is actually a very useful tool to have. We've been talking about huge models up to now. But it's debatable whether this is a huge model. This definitely was a huge model a year ago. Today, with better tools, I can train it in an hour on a fairly modest TPU v2 32-core pod. So even as an individual data scientist, that is a very useful tool for me to have handy when I need to do a round of trainings on a model like this, because someone wants an animal detection model. And bringing the training down to the one hour space, or 20 minutes space, allows me to work a lot faster and iterate a lot faster on the hyperparemeters, on the fine tuning, and so on. You see on a single TPU v3, it's the bottom line. And if we were to train this on a GPU-- so remember our rule of thumb from the beginning. One TPU v2, roughly five GPUs. Therefore 1 TPU v3, roughly 10 GPUs. So the GPU line would be one tenth of the lowest line on this graph. I didn't put it because it would barely register there. That shows you the change of scale at which you can be training your models using TPUs. You might be wondering about this. So as you scale, one thing that might happen is that you have to adjust your learning rate schedule. So this is actually the learning rate schedule I have used to train the model on the 128 core TPU pod. Just a couple of words, because it might not be the most usual learning rate schedule you have ever seen. There is this ramp up. So the second part is exponential decay. That's fairly standard. But the ramp up part, that is because we are starting from ResNet-50, initialized with pre-trained weights. But we still leave those weights trainable. So we are training the whole thing. It's not transfer learning. It's just fine tuning of pre-trained ResNet-50. And when you do that, and you train very fast, using big batches, as we do here, the batch size here is 64 times 128. So it's a very big batch size. You might actually break those pre-trained weights in ways that harm your precision. So that's why it's quite usual to have a ramp up period to make sure that the network, in its initial training phases, when it doesn't know what it's doing, does not completely destroy the information in the pre-trained weights. So we did it. We did train this model here on stage in 20 minutes. And the demo worked, I'm really glad about that. So this is the end. What we have seen is TPUs and TPU pods. Fast, yes. But mostly cost effective. Very cost effective way and a good tool to have for any data scientist. Also, and more specifically for very large models, but for what used to be large models in the past and which are normal models today, such as a ResNet-50 [INAUDIBLE]. It's a very useful tools. And then Cloud TPU pods, where you can actually enable not only data, but model parallelism, using this new library called Mesh TensorFlow. A couple of links here with more information if you would like to know more. Yes, you can take a picture. And if you have more questions, we will be at the AI ML pod, the red one, in front of one TPU rack. So you can see this one live and get a feel for what kind of computer it is. And with that, thank you very much. [APPLAUSE] [MUSIC PLAYING]
B1 tpu model tpus training mesh data Cloud TPU Pods: AI Supercomputing for Large Machine Learning Problems (Google I/O'19) 3 0 林宜悉 posted on 2020/03/25 More Share Save Report Video vocabulary