Subtitles section Play video Print subtitles LUCASZ KAISER: Hi, my name is Lucasz Kaiser, and I want to tell you in this final session about Tensor2Tensor, which is a library we've built on top of TensorFlow to organize the world's models and data sets. So I want to tell you about the motivation, and how it came together, and what you can do with it. But also if you have any questions in the meantime anytime, just ask. And only if you've already used Tensor2Tensor, in that case, you might have even more questions. But the motivation behind this library is-- so I am a researcher in machine learning. I also worked from production [INAUDIBLE] models, and research can be very annoying. It can be very annoying to researchers, and it's even more annoying to people who put it into production, because the research works like this. You have an idea. You want to try it out. It's machine learning, and you think, well, I will change something in the model, it will be great. It will solve physics problems, or translation, or whatever. So we have this idea, and you're like, it's so simple. I just need to change one tweak, but then, OK, I need to get the data. Where was it? So we search it online, you find it, and it's like, well, so I need to preprocess it. You implement some data reading. You download the model that someone else did. And it doesn't give the result at all that someone else wrote in the paper. It's worse. It works 10 times slower. It doesn't train at all. So then you start tweaking it. Turns out, someone else had this postscript that preprocessed the data in a certain way that improved the model 10 times. So you add that. Then it turns out your input pipeline is not performing, because it doesn't put data on GPU or CPU or whatever. So you tweak that. Before you start with your research idea, you've spent half a year on just reproducing what's been done before. So then great. Then you do your idea. It works. You write the paper. You submit it. You put it in the repo on GitHub, which has a README file that says, well, I downloaded the data from there, but this link has already gone by two days after he made the repo. And then I applied. And you describe all these 17 tweaks, but maybe you forgot one option that was crucial. Well, and then there is the next paper and the next research, and the next person comes and does the same. So it's all great except the production team, at some point, they get like, well, we should put it into production. It's a great result. And then they need to track this whole path, redo all of it, and try to get the same. So it's a very difficult state of the world. And it's even worse because there are different hardware configurations. So maybe something that trained well on a CPU does not train on a GPU, or maybe you need an 8 GPU setup, and so on and so forth. So the idea behind Tensor2Tensor was, let's make a library that has at least a bunch of standard models for standard tasks that includes the data and the preprocessing. So you really can, on a command line, just say, please get me this data set and this model, and train it, and make it so that we can have regression tests and actually know that it will train, and that it will not break with TensorFlow 1.10. And that it will train both on the GPU and on a TPU, and on a CPU-- to have it in a more organized fashion. And the thing that prompted Tensor2Tensor, the thing why I started it, was machine translation. So I worked with the Google Translate team on launching neural networks for translation. And this was two years ago, and this was amazing work. Because before that, machine translation was done in this way like-- it was called phrase-based machine translation. So if you find some alignments of phrases, then you translate the phrases, and then you try to realign the sentences to make them work. And the results in machine translation are normally measured in terms of something called the BLEU score. I will not go into the details of what it was. It's like the higher the better. So for example, for English-German translation, the BLEU score that human translators get is about 30. And the best phrase-based-- so non-neural network, non-deep-learning-- systems were about 20, 21. And it's been, really, a decade of research at least, maybe more. So when I was doing a PhD, if you got one BLEU score up, you would be a star. It was good PhD. If you went from 21 to 22, it would be amazing. So then the neural networks came. And the early LSTMs in 2015, they were like 19.5, 20. And we talked to the Translate team, and they were like, you know, guys, it's fun. It's interesting, because it's simpler in a way. You just train the network on the data. You don't have all the-- no language-specific stuff. It's a simpler system. But it gets worse results, and who knows if it will ever get better. But then the neural network research moved on, and people started getting 21, 22. So the Translate team, together with Brain, where I work, made the big effort to try to make a really large LSTM model, which is called the GNMT, the Google Neural Machine Translation. And indeed it was a huge improvement. It got to 25. BLEU, later-- we added mixtures of experts, it even got to 26. So they were amazed. It launched in production, and well, it was like a two-year effort to take the papers, scale them up, launch it. And to get these really good results, you really needed a large network. So as an example why this is important, or why this was important for Google is-- so you have a sentence in German here, which is like, "problems can never be solved with the same way of thinking that caused them." And this neural translator translates the sentence kind of the way it should-- I doubt there is a much better translation-- while the phrase-based translators, you can see, "no problem can be solved from the same consciousness that they have arisen." It kind of shows how the phrase-based method works. Every word or phrase is translated correctly, but the whole thing does not exactly add up. You can see it's a very machiney way, and it's not so clear what it is supposed to say. So the big advantage of neural networks is they train on whole sentences. They can even train on paragraphs. They can be very fluent. Since they take into account the whole context at once, it's a really big improvement. And if you ask people to score translations, this really starts coming close-- or at least 80% of the distance to what human translators do, at least on newspaper language-- not poetry. [CHUCKLING] We're nowhere near that. So it was great. We got the high BLEU scores. We reduced the distance to human translators. It turned out the one system can handle different languages, and sometimes even multilingual translations. But there were problems. So one problem is the training time. It took about a week on a setup of 64 to 128 GPUs. And all the code for that was done specifically for this hardware setup. So it was distributed training where everything in the machine learning pipeline was tuned for the hardware. Well, because we knew we will train on this data center on this hardware. So why not? Well, the problem is batch sizes and learning rates, they come together. You can not tune them separately. And then you add tricks. Then you tweak some things in the model that are really good for this specific setup, for this specific learning grade or batch size. This distributed setup was training asynchronously. So there were delayed gradients. It's a regular [? ISO, ?] so you decrease dropout. You start doing parts of the model specifically for a hardware setup. And then you write the paper. We did write a paper. It was cited. But nobody ever outside of Google managed to reproduce this, get the same result with the same network, because we can give you our hyperparameters, but you're running on a different hardware setup. You will not get the same result. And then, in addition to the machine learning setup, there is the whole will tokenization pipeline, data preparation pipeline. And even though these results are on the public data, the whole pre-processing is also partially Google. It doesn't matter much. But it really did not allow other people to build on top of this work. So it launched, it was a success for us, but in the research sense, we felt that it came short a little bit. Because for one, I mean, you'd need a huge hardware setup to train it. And on the other hand, even if you had the hardware setup, or if you got it on cloud and wanted to invest in it, there would still be no way for you to just do it. And that was the prompt, why I thought, OK, we need to make a library for the next time we build a model. So the LSTMs were like the first wave of sequence models with the first great results. But I thought, OK, the next time when we come build a model, we need to have a library that will ensure it works at Google and outside, that will make sure when you train on one GPU, you get a worse result, but we know what it is. We can tell you, yes, you're on the same setup. Just scale up. And it should work on cloud so you can just, if you want better result, get some money, pay for larger hardware. But it should be tested, done, and reproducible outside. And the need-- so the Tensor2Tensor library started with the model called Transformer, which is the next generation of sequence models. It's based on self-attentional layers. And we designed this model. It got even better results. It got 28.4 BLEU. Now we are on par with BLEU with human translators. So this metric is not good anymore. It just means that we need better metrics. But this thing, it can train in one day on an 8 GPU machine. So you can just get it. Get an 8 GPU machine. It can be your machine, it can be in the cloud. Train, get the results. And it's not just reproducible in principle. There's been a number of groups that reproduced it, got the same results, wrote follow-up papers, changed the architecture. It went up to 29, it went up to 30. There are companies that use this code. They launched competition to Google Translate. Well, that happens. And Google Translate improved again. But in a sense, I feel like it's been a larger success in terms of community and research, and it raised the bar for everyone. It raised our quality as well. So that's how it came to be that we feel that it's really important to make things reproducible, open, and test them on different configurations and different hardwares. Because then we can isolate what parts are really good fundamentally from the parts that are just tweaks that work in one configuration and fail in the other. So that's our solution to this annoying research problem. It's a solution that requires a lot of work, and it's based on many layers. So the bottom layer is TensorFlow. And TensorFlow, in the meantime, has also evolved a lot. So we have TF Data, which is the TensorFlow data input pipeline. It was also not there a year ago. It's in the newer releases. It helps to build input pipelines that are performant across different hardware. There is TF Layers and Keras, which are higher-level libraries. So you don't need to write, in small TensorFlow Ops, everything. You can write things on a higher level of abstraction. There is the new distribution strategy, which allows you to have an estimator and say, OK, train on eight GPUs, train on one GPU, train on a distributed setup, train on TPU. You don't need rewrite handlers for everything on your own. But that's just the basics. And then comes the Tensor2Tensor part, which is like, OK, I want a good translation model, where do I get the data from? It's somewhere on the internet, but where? How do I download it? How do I pre-process it? Which model should I use? Which hyperparameters of the model? What if I want to change a model? I just want to try my own, but on the same data. What do I need to change? How can it be done? What if I want to use the same model, but on my own data? I have a translation company. I have some data. I want to check how that works. What if I want to share? What if I want to share a part? What if I want to share everything? That's what Tensor2Tensor does. So it's a library. It's a library that has a lot of data sets-- I think it's more than 100 by now-- all the standard ones, images, ImageNet, CIFAR, MNIST, image captionings, Coco, translations for a number of languages, just pure language modeling data sets, speech to text, music, video data sets. It's recently very active. If you're into research, you can either probably find it here or there is a very easy tutorial on how to add it. And then with the data sets come the models. There is the transformer, as I said you-- told you-- that's how it started. But then the standard things, ResNet, then more fancy image models like RevNet, ShakeShake, Xception, Sequence Model, also a bunch of them, SliceNet, ByteNet, that's subversion of WaveNet. LSTMs then algorithmic models like Neural GPUs. There was a bunch of recent papers. So it's a selection of models and data sets, but also the framework. So if you want to train a model, there is one way to do it. There are many models. You need to specify which one. And there are many datasets. You need to specify which one. But there is one training binary. So it's always the same. No two page read me, please run these commands and for another run different comments. Same for decoding. You want to get your the outputs of your model on the new data set? One command, t2t decoder. You want to export it to make a server or a website? One command. And then you want to train, train locally, you just run the binary. If you want to train on Google Cloud, just give your cloud project ID. You want to train on cloud TPU, just say dash dash use TPU and give the ID. You need to tune hyper parameters. There is support for it on Google Cloud. We have ranges. Just specify the hyperparameter range and tune. You want to train distributed on multiple machines, there is a script for that. So Tensor2Tensor are data sets, models, and everything around that's needed to train them. Now, this project, due to our experience with translation, we decided it's open by default. And open by default, in a similar way as TensorFlow, means every internal code change we push gets immediately pushed to GitHub. And every PR from GitHub, we import internally and merge. So there is just one code base. And since this project is pure Python, there's no magic. It's the same code at Google and outside. And it's like internally we have dozens of code changes a day. They get pushed out to GitHub immediately. And since a lot of brain researchers use this daily, there are things like this. So there was a tweet about research and optimizers. And it was like, there are optimizers like AMS grad, adaptive learning create methods. And then James Bradbury at Facebook at that time tweeted, well, it's not the latest optimizer. The latest optimizer is in Tensor2Tensor encode with a to do to write a paper. The paper is written now. It's a very good optimizer [INAUDIBLE] factor. But yeah, the code, it just appears there. The papers come later. But it makes no sense to wait. I mean, it's an open research community. These ideas, sometimes they work, sometimes they don't. But that's how we work. We push things out. And then we train and see. Actually, by the time the paper appeared, some people in the open source community have already trained models with it. So we added the results. They were happy to. It's a very good optimizer, saves a lot of memory. It's a big collaboration. So as I said, this is just one list of names. It should probably be longer by now. It's a collaboration between Google Brain, DeepMind. Currently there are researchers from the Czech Republic on GitHub and Germany, so over 100 contributors by now, over 100,000 downloads. I was surprised, because Ryan got this number for this talk. And I was like, how comes there are 100,000 people using this thing? It's for ML researchers. But whatever, they are. And there are a lot of papers that use it. So these are just the papers that have already been published and accepted. There is a long pipeline of other papers and possibly some we don't know about. So as I told you, it's a unified framework for models. So how does it work? Well, the main script of the whole library is t2t-trainer. It's the one binary where you tell what model, what data set, what hyperparameters, go train. So that's the basic command line-- install tensor2tensor and then call t2t-trainer. The problem is the name of the dataset. And it also includes all details like how to pre-process, how to resize images, and so on and so forth. Model is the name of the model and hyperparameter set is which configuration, which hyperparameters of the model, which learning grades, and so on, to use. And then, of course, you need to specify where to store the data, where to store the model checkpoints for how many steps to train and so on. But that's the full command. And for example, you want a summarization model. There is a summarization data set that's been used in academia. It's from CNN and Daily Mail. You say you want the transformer, and there is a hyperparameter set that does well on summarization. You want to image classification, like CIFAR10 is quite a standard benchmark for papers. You say, I want image CIFAR10. ShakeShake model, this was state of the art a year or a year ago. This changes quickly. You want the big model, you go train it. And the important thing is we know this result. This gives less than 3% error on CIFAR, which is, as I said, was state of the art a year ago. Now it's down to two. But we can be certain that when you run this command for the specified number of training steps, you will actually get this state of the art, because internally we run regression tests that start this every day and tell us if it fails. So the usefulness of this framework is not just in-- well, we have it grouped into one command. But because it's automated, we can start testing it. If there is a new change in TensorFlow that will break some kernel and it doesn't come out in the unit test, it often comes out in the regression tests of these models. And we found at least three bugs in the recent two versions of TensorFlow, because some things in machine learning only appear-- like, things still run, things still train, but they give you 2% less. These are very tricky bugs to find, but if you know which day it started failing, it's much easier. Translation, as I said, it started with transformer. We added more changes. Nowadays, it trains to over 29 BLEU. It's a very good translation model. Just run this command on an 8 GPU machine. Wait. You will get a really good translator. Speech recognition, there is the open librispeech data set. Transformer model without any language model gets a really good word error rate. Some more fancy things, like if you want to generate images, it's recently popular, have a model that just generates you. Either phases or landscapes, there are different datasets. So this is a model that you train just on CIFAR 10 reversed. Every data set in tensor2tensor you can add this underscore rev. It reverses inputs and targets. And generative models, they can take it and generate it. For translation, it's very useful if you want, instead of English, German, and German, English, you just do underscore rev. It reverses the ordering of the dataset. So yeah, so they're the commands. But so for example, on an image transformer, if you try to train this on a single GPU to get to this 2.9 beats per dimension, you'd probably have to wait half a year. So that's not very practical. But that's the point. Currently it's a very hard task to do a very good image generative model. One GPU might not be enough for state of the art. So if you want to really push it, you need to train at scale. You need to train multi GPU. You need to go to TPUs. Well, this is the command you've seen before. To make it multi GPU, you just say worker GPU equals 8. This will use eight GPUs on your machine. Just make batches eight times larger. Run the eight GPUs in parallel, and there it trains. Want to train on a cloud TPU? Use TPU, and you need to specify the master of the TPU instance that you booked on cloud. It trains the same. Want to train on a cloud TPU pod? I don't know, I guess you've heard today, Google is opening up to public the pods which go up to 256, I think, TPU cores. Just say, oh, maybe up to 512, what I see from this command. Just say do it. Train. It will train much faster. How much faster? Well, we've observed nearly linear scaling up to half a pod, and I think, like, 10% loss on a full pod. So these models, the translation models, they can train on a pod for an hour, and you get state of the art performance. So this can really make you train very fast. Same for ImageNet. Well, I say an hour, there's now a competition. Can we get down to half an hour, 18 minutes. I'm not sure how important that is, but it's really fast. Now, maybe you don't just care about training one set of hyperparameters. Maybe you have your own data set and you need to tune hyperparameters, find a really good model for your application. Say cloud ML engine auto tune. You need to say what metric to optimize-- so accuracy, perplexity, these are the standard metrics that people tune the models for. Say how many trials, how many of them to run in parallel. And the final line is a range. So a range says, well, try learning grades from 0.1 to 0.3, logarithmically or uniformly. These are the things you specify. So you can specify continuous things in an interval and you can specify discrete things. Just try two, three, four, five layers. And the tuner, it starts the number of parallel trials, so 20 in this command. The first one is random, and then the next one, it has a quite sophisticated non-differential optimizing model which is Bayesian mixed with CMAES. What to try next, it will try another 20 trials. Usually after, like, 60 or so it starts getting to a good parameter space. So if you need to optimize, that's how you do it. And, like, if you're wondering what range to optimize, we have a few ranges in code that we usually optimize for when we start with new data. On a TPU pod, if you want a model that doesn't just do training on large batches, data parallel, but model parallel. If you want to have a model with a huge number of parameters, more than one billion, you can use something we call mesh TensorFlow that we also have started developing in tensor2tensor, which allows to do model parallelism in an easy way. It just say, split my tensor into the cores, how many cores you have. Or split it eight-wise on this dimension and four-wise on this dimension. I'll tell a bit more later about that. It allows you to train really large models if you want this. And that gives really good results. So that's how the library works. You can go and use it with the models and data sets that are there. But what if you want to just get the data from the data set or to add your own data set? Well, it's still a Python library. You can just import it. And there is this problem class, which you can use without any other part of the library. So you can just-- you get an instance of the problem class either by-- so we have this registry to call things by name. So you can say registry dot problem and the name. You can say problems dot available to get all the available names. Or you can instantiate it directly. If you look into the code where the class is, you can say, give me this class. And then generate data. The problem class knows where on the internet to find the data and how to pre-process it. So the generate data will go to this place, download it from the internet, and pre-process into TF example files in the same way that we use it or that the authors of this data set decide it is good for their models. And then you call a problem dot data set, which reads it from this, can gives you this queue of tensors in the form of a data set. So that's for data sets. For a model, all our models are a subclass of this t2t model class, which itself is a Keras layer. So if you want to take one model, plug it together with another one, same as with layers. You get a model. You can get it again either by registry or by class name. Call the model on a dictionary of tensors. And you get the outputs and the losses if you need. So you can add your own. You can subclass the base problem class, or for text to text or image to class problems, there are subclasses that are easier to subclass. You just basically point where your images are and get them from any format to this. And for your own model, you can subclass t2t model. If you want to share it, it's on GitHub. Make a PR. Under models, there is a research sub directory where there are models that we don't consider, that we don't regression test. We allow them to be free. If you have an idea, want to share, put it there. People might come, run it, tell you it's great. So yeah, Tensor2Tensor, it's a set of data sets, models, and scripts to run it everywhere. And yeah, looking ahead, it's growing. So we are happy to have more data sets. We are happy to have more models. We are ramping up on regression testing. We're moving models out of research to the more official part to have them tested and stabilized. On the technical side, we are on to simplifying the infrastructure. So TensorFlow 2 is coming. The code base-- well, it's started more than a year ago. It's based on estimators. We are moving it to Keras. We had our own scripts and binaries for running on TPUs and multi-GPUs or moving to a distribution strategy. We are allowing experts to TF hub. So this is a library for training your own models. The main thing is the trainer. Once it's trained and you want to share a pre-trained model, TF hub is the right place. You can export it with one line. And the mesh TensorFlow allows to train huge models on cloud pods. I will tell you a little bit more about it in a moment. On the research side, there's been a lot of research in video models recently. We have a ton of them in Tensor2Tensor. And they're getting better and better. And it's a fun thing to generate your own videos. There is-- the new thing in machine translation is using back translation. So it uses an unsupervised-- you have a corpus of English and a corpus of German, but no matching. And to use a model you have to generate data and then back translate and it shows improvements. And in general, well, hyperparameter tuning is an important thing in research, too. So it's integrated now, and we're doing more and more of it. Reinforcement learning, guns, well, as I said, there are a lot of researchers using it. So there's a lot going on. One of the things, Mesh TensorFlow, it's a tool for training huge models, meaning really huge. Like, you can have one model that uses a whole TPU pod, 4 terabytes of RAM. That's how many parameters you can do. It's by Noam, Youlong, Niki, Ashish, and many people. So what if you want to train an image generation models on high definition videos or process data that's huge even at batch size 1? So you cannot just say, oh, I'll do one thing on one core, another on one core, just split it by data. One data example has to go on the whole machine. And then there needs to be a convolution that applies to it, or a matrix multiplication. So how can we do this and not drown into writing manually, OK, on this core, do this, and then slice back? So the idea is build every tensor, every dimension it has needs to be named. For example, you name the first dimension is batch. The second is length. And the third is just the hidden vector. And for every dimension, you specify how it will be laid out on a device. So you say, OK, batches-- for example, modern devices, they have 2D-- they're like a 2D mesh of chips. So the communication is fast to nearby chips, but not so fast across. So you can say if it's a grid of chips in hardware, you can say, OK, the batch dimension will be on the horizontal chips and the length will be on the vertical ones. So we define how to split the tensor on the hardware mesh. And then the operations are already optimized to use the processing of the hardware to do fast communication and operate on these tensors as if they were single sensors. So you specify the dimensions by name. You specify their layout. And then you write your model as if it was a single GPU model. And so everything stays simple except for this layout thing, which you need to think a little bit about. We did a transformer on it. We did an image transformer. We can train models with 5 million parameters on TPU pods with over 50% utilization. So this paper, it's also a to do paper, it should be coming out in a few weeks. Not yet there, but it's new state-of-the-art on translation language modeling. It's the next step in really good models. It also generates nice images. So big models are good. They give great results. And this is a way of writing them simply. So yeah, that's the Mesh TensorFlow. And we try to make it-- so it runs on TPU pods, but it also runs on clusters of GPUs, because we tried to not make the mistake again to do something that just runs on one hardware. And with the Tensor2Tensor library, you're welcome to be part of it. Give it a try. Use it. We are on GitHub. There is a GitHub chat. There is an active lobby for Tensor2Tensor, where we also try to be everyday day to help. And yep, that's it. Thank you very much. [APPLAUSE]
B1 model train data translation tpu hardware Tensor2Tensor (TensorFlow @ O’Reilly AI Conference, San Francisco '18) 5 0 林宜悉 posted on 2020/04/04 More Share Save Report Video vocabulary