Placeholder Image

Subtitles section Play video

  • LUCASZ KAISER: Hi, my name is Lucasz Kaiser,

  • and I want to tell you in this final session

  • about Tensor2Tensor, which is a library we've

  • built on top of TensorFlow to organize the world's models

  • and data sets.

  • So I want to tell you about the motivation,

  • and how it came together, and what you can do with it.

  • But also if you have any questions

  • in the meantime anytime, just ask.

  • And only if you've already used Tensor2Tensor, in that case,

  • you might have even more questions.

  • But the motivation behind this library

  • is-- so I am a researcher in machine learning.

  • I also worked from production [INAUDIBLE] models,

  • and research can be very annoying.

  • It can be very annoying to researchers,

  • and it's even more annoying to people

  • who put it into production, because the research works

  • like this.

  • You have an idea.

  • You want to try it out.

  • It's machine learning, and you think,

  • well, I will change something in the model, it will be great.

  • It will solve physics problems, or translation, or whatever.

  • So we have this idea, and you're like, it's so simple.

  • I just need to change one tweak, but then, OK, I

  • need to get the data.

  • Where was it?

  • So we search it online, you find it,

  • and it's like, well, so I need to preprocess it.

  • You implement some data reading.

  • You download the model that someone else did.

  • And it doesn't give the result at all

  • that someone else wrote in the paper.

  • It's worse.

  • It works 10 times slower.

  • It doesn't train at all.

  • So then you start tweaking it.

  • Turns out, someone else had this postscript

  • that preprocessed the data in a certain way that

  • improved the model 10 times.

  • So you add that.

  • Then it turns out your input pipeline is not performing,

  • because it doesn't put data on GPU or CPU or whatever.

  • So you tweak that.

  • Before you start with your research idea,

  • you've spent half a year on just reproducing

  • what's been done before.

  • So then great.

  • Then you do your idea.

  • It works.

  • You write the paper.

  • You submit it.

  • You put it in the repo on GitHub,

  • which has a README file that says,

  • well, I downloaded the data from there,

  • but this link has already gone by two days

  • after he made the repo.

  • And then I applied.

  • And you describe all these 17 tweaks,

  • but maybe you forgot one option that was crucial.

  • Well, and then there is the next paper and the next research,

  • and the next person comes and does the same.

  • So it's all great except the production team, at some point,

  • they get like, well, we should put it into production.

  • It's a great result. And then they

  • need to track this whole path, redo all of it,

  • and try to get the same.

  • So it's a very difficult state of the world.

  • And it's even worse because there are different hardware

  • configurations.

  • So maybe something that trained well on a CPU

  • does not train on a GPU, or maybe you need an 8 GPU setup,

  • and so on and so forth.

  • So the idea behind Tensor2Tensor was,

  • let's make a library that has at least a bunch

  • of standard models for standard tasks that includes

  • the data and the preprocessing.

  • So you really can, on a command line, just say,

  • please get me this data set and this model, and train it,

  • and make it so that we can have regression tests and actually

  • know that it will train, and that it will not break with

  • TensorFlow 1.10.

  • And that it will train both on the GPU and on a TPU,

  • and on a CPU--

  • to have it in a more organized fashion.

  • And the thing that prompted Tensor2Tensor,

  • the thing why I started it, was machine translation.

  • So I worked with the Google Translate team

  • on launching neural networks for translation.

  • And this was two years ago, and this was amazing work.

  • Because before that, machine translation

  • was done in this way like--

  • it was called phrase-based machine translation.

  • So if you find some alignments of phrases,

  • then you translate the phrases, and then you

  • try to realign the sentences to make them work.

  • And the results in machine translation

  • are normally measured in terms of something

  • called the BLEU score.

  • I will not go into the details of what it was.

  • It's like the higher the better.

  • So for example, for English-German translation,

  • the BLEU score that human translators get is about 30.

  • And the best phrase-based-- so non-neural network,

  • non-deep-learning-- systems were about 20, 21.

  • And it's been, really, a decade of research at least,

  • maybe more.

  • So when I was doing a PhD, if you got one BLEU score up,

  • you would be a star.

  • It was good PhD.

  • If you went from 21 to 22, it would be amazing.

  • So then the neural networks came.

  • And the early LSTMs in 2015, they were like 19.5, 20.

  • And we talked to the Translate team,

  • and they were like, you know, guys, it's fun.

  • It's interesting, because it's simpler in a way.

  • You just train the network on the data.

  • You don't have all the--

  • no language-specific stuff.

  • It's a simpler system.

  • But it gets worse results, and who knows

  • if it will ever get better.

  • But then the neural network research moved on,

  • and people started getting 21, 22.

  • So the Translate team, together with Brain, where I work,

  • made the big effort to try to make a really large LSTM

  • model, which is called the GNMT, the Google Neural Machine

  • Translation.

  • And indeed it was a huge improvement.

  • It got to 25.

  • BLEU, later-- we added mixtures of experts, it even got to 26.

  • So they were amazed.

  • It launched in production, and well, it

  • was like a two-year effort to take the papers,

  • scale them up, launch it.

  • And to get these really good results,

  • you really needed a large network.

  • So as an example why this is important,

  • or why this was important for Google is--

  • so you have a sentence in German here,

  • which is like, "problems can never

  • be solved with the same way of thinking that caused them."

  • And this neural translator translates the sentence kind

  • of the way it should--

  • I doubt there is a much better translation--

  • while the phrase-based translators, you can see,

  • "no problem can be solved from the same consciousness

  • that they have arisen."

  • It kind of shows how the phrase-based method works.

  • Every word or phrase is translated correctly,

  • but the whole thing does not exactly add up.

  • You can see it's a very machiney way,

  • and it's not so clear what it is supposed to say.

  • So the big advantage of neural networks

  • is they train on whole sentences.

  • They can even train on paragraphs.

  • They can be very fluent.

  • Since they take into account the whole context at once,

  • it's a really big improvement.

  • And if you ask people to score translations,

  • this really starts coming close--

  • or at least 80% of the distance to what human translators do,

  • at least on newspaper language-- not poetry.

  • [CHUCKLING]

  • We're nowhere near that.

  • So it was great.

  • We got the high BLEU scores.

  • We reduced the distance to human translators.

  • It turned out the one system can handle

  • different languages, and sometimes even

  • multilingual translations.

  • But there were problems.

  • So one problem is the training time.

  • It took about a week on a setup of 64 to 128 GPUs.

  • And all the code for that was done specifically

  • for this hardware setup.

  • So it was distributed training where

  • everything in the machine learning pipeline

  • was tuned for the hardware.

  • Well, because we knew we will train on this data

  • center on this hardware.

  • So why not?

  • Well, the problem is batch sizes and learning rates,

  • they come together.

  • You can not tune them separately.

  • And then you add tricks.

  • Then you tweak some things in the model

  • that are really good for this specific setup,

  • for this specific learning grade or batch size.

  • This distributed setup was training asynchronously.

  • So there were delayed gradients.

  • It's a regular [? ISO, ?] so you decrease dropout.

  • You start doing parts of the model

  • specifically for a hardware setup.

  • And then you write the paper.

  • We did write a paper.

  • It was cited.

  • But nobody ever outside of Google

  • managed to reproduce this, get the same result

  • with the same network, because we can give you

  • our hyperparameters, but you're running on a different hardware

  • setup.

  • You will not get the same result.

  • And then, in addition to the machine learning setup,

  • there is the whole will tokenization pipeline, data

  • preparation pipeline.

  • And even though these results are on the public data,

  • the whole pre-processing is also partially Google.

  • It doesn't matter much.

  • But it really did not allow other people

  • to build on top of this work.

  • So it launched, it was a success for us,

  • but in the research sense, we felt that it

  • came short a little bit.

  • Because for one, I mean, you'd need a huge hardware setup

  • to train it.

  • And on the other hand, even if you had the hardware setup,

  • or if you got it on cloud and wanted to invest in it,

  • there would still be no way for you to just do it.

  • And that was the prompt, why I thought,

  • OK, we need to make a library for the next time

  • we build a model.

  • So the LSTMs were like the first wave of sequence models

  • with the first great results.

  • But I thought, OK, the next time when we come build a model,

  • we need to have a library that will ensure it works at Google

  • and outside, that will make sure when you train on one GPU,

  • you get a worse result, but we know what it is.

  • We can tell you, yes, you're on the same setup.

  • Just scale up.

  • And it should work on cloud so you can just,

  • if you want better result, get some money,

  • pay for larger hardware.

  • But it should be tested, done, and reproducible outside.

  • And the need-- so the Tensor2Tensor library started

  • with the model called Transformer,

  • which is the next generation of sequence models.

  • It's based on self-attentional layers.

  • And we designed this model.

  • It got even better results.

  • It got 28.4 BLEU.

  • Now we are on par with BLEU with human translators.

  • So this metric is not good anymore.

  • It just means that we need better metrics.

  • But this thing, it can train in one day on an 8 GPU machine.

  • So you can just get it.

  • Get an 8 GPU machine.

  • It can be your machine, it can be in the cloud.

  • Train, get the results.

  • And it's not just reproducible in principle.

  • There's been a number of groups that reproduced it, got

  • the same results, wrote follow-up papers,

  • changed the architecture.

  • It went up to 29, it went up to 30.

  • There are companies that use this code.

  • They launched competition to Google Translate.

  • Well, that happens.

  • And Google Translate improved again.

  • But in a sense, I feel like it's been a larger success in terms

  • of community and research, and it raised the bar for everyone.

  • It raised our quality as well.

  • So that's how it came to be that we

  • feel that it's really important to make things reproducible,

  • open, and test them on different configurations

  • and different hardwares.

  • Because then we can isolate what parts are really

  • good fundamentally from the parts that are just

  • tweaks that work in one configuration

  • and fail in the other.

  • So that's our solution to this annoying research problem.

  • It's a solution that requires a lot of work,

  • and it's based on many layers.

  • So the bottom layer is TensorFlow.

  • And TensorFlow, in the meantime, has also evolved a lot.

  • So we have TF Data, which is the TensorFlow data input pipeline.

  • It was also not there a year ago.

  • It's in the newer releases.

  • It helps to build input pipelines that

  • are performant across different hardware.

  • There is TF Layers and Keras, which

  • are higher-level libraries.

  • So you don't need to write, in small TensorFlow

  • Ops, everything.

  • You can write things on a higher level of abstraction.

  • There is the new distribution strategy,

  • which allows you to have an estimator

  • and say, OK, train on eight GPUs,

  • train on one GPU, train on a distributed setup,

  • train on TPU.

  • You don't need rewrite handlers for everything on your own.

  • But that's just the basics.

  • And then comes the Tensor2Tensor part, which is like, OK,

  • I want a good translation model, where do I get the data from?

  • It's somewhere on the internet, but where?

  • How do I download it?

  • How do I pre-process it?

  • Which model should I use?

  • Which hyperparameters of the model?

  • What if I want to change a model?

  • I just want to try my own, but on the same data.

  • What do I need to change?

  • How can it be done?

  • What if I want to use the same model, but on my own data?

  • I have a translation company.

  • I have some data.

  • I want to check how that works.

  • What if I want to share?

  • What if I want to share a part?

  • What if I want to share everything?

  • That's what Tensor2Tensor does.

  • So it's a library.

  • It's a library that has a lot of data sets--

  • I think it's more than 100 by now--

  • all the standard ones, images, ImageNet, CIFAR, MNIST,

  • image captionings, Coco, translations

  • for a number of languages, just pure language

  • modeling data sets, speech to text, music, video data sets.

  • It's recently very active.

  • If you're into research, you can either probably find it here

  • or there is a very easy tutorial on how to add it.

  • And then with the data sets come the models.

  • There is the transformer, as I said you--

  • told you-- that's how it started.

  • But then the standard things, ResNet, then more fancy image

  • models like RevNet, ShakeShake, Xception, Sequence Model,

  • also a bunch of them, SliceNet, ByteNet,

  • that's subversion of WaveNet.

  • LSTMs then algorithmic models like Neural GPUs.

  • There was a bunch of recent papers.

  • So it's a selection of models and data sets,

  • but also the framework.

  • So if you want to train a model, there is one way to do it.

  • There are many models.

  • You need to specify which one.

  • And there are many datasets.

  • You need to specify which one.

  • But there is one training binary.

  • So it's always the same.

  • No two page read me, please run these commands

  • and for another run different comments.

  • Same for decoding.

  • You want to get your the outputs of your model

  • on the new data set?

  • One command, t2t decoder.

  • You want to export it to make a server or a website?

  • One command.

  • And then you want to train, train locally,

  • you just run the binary.

  • If you want to train on Google Cloud,

  • just give your cloud project ID.

  • You want to train on cloud TPU, just say dash dash use TPU

  • and give the ID.

  • You need to tune hyper parameters.

  • There is support for it on Google Cloud.

  • We have ranges.

  • Just specify the hyperparameter range and tune.

  • You want to train distributed on multiple machines,

  • there is a script for that.

  • So Tensor2Tensor are data sets, models, and everything

  • around that's needed to train them.

  • Now, this project, due to our experience with translation,

  • we decided it's open by default. And open

  • by default, in a similar way as TensorFlow,

  • means every internal code change we push gets

  • immediately pushed to GitHub.

  • And every PR from GitHub, we import internally and merge.

  • So there is just one code base.

  • And since this project is pure Python, there's no magic.

  • It's the same code at Google and outside.

  • And it's like internally we have dozens of code changes a day.

  • They get pushed out to GitHub immediately.

  • And since a lot of brain researchers use this daily,

  • there are things like this.

  • So there was a tweet about research and optimizers.

  • And it was like, there are optimizers like AMS

  • grad, adaptive learning create methods.

  • And then James Bradbury at Facebook

  • at that time tweeted, well, it's not the latest optimizer.

  • The latest optimizer is in Tensor2Tensor encode with a

  • to do to write a paper.

  • The paper is written now.

  • It's a very good optimizer [INAUDIBLE] factor.

  • But yeah, the code, it just appears there.

  • The papers come later.

  • But it makes no sense to wait.

  • I mean, it's an open research community.

  • These ideas, sometimes they work, sometimes they don't.

  • But that's how we work.

  • We push things out.

  • And then we train and see.

  • Actually, by the time the paper appeared,

  • some people in the open source community

  • have already trained models with it.

  • So we added the results.

  • They were happy to.

  • It's a very good optimizer, saves a lot of memory.

  • It's a big collaboration.

  • So as I said, this is just one list of names.

  • It should probably be longer by now.

  • It's a collaboration between Google Brain, DeepMind.

  • Currently there are researchers from the Czech Republic

  • on GitHub and Germany, so over 100 contributors by now,

  • over 100,000 downloads.

  • I was surprised, because Ryan got this number for this talk.

  • And I was like, how comes there are 100,000 people using

  • this thing?

  • It's for ML researchers.

  • But whatever, they are.

  • And there are a lot of papers that use it.

  • So these are just the papers that have already

  • been published and accepted.

  • There is a long pipeline of other papers

  • and possibly some we don't know about.

  • So as I told you, it's a unified framework for models.

  • So how does it work?

  • Well, the main script of the whole library is t2t-trainer.

  • It's the one binary where you tell what model, what data set,

  • what hyperparameters, go train.

  • So that's the basic command line-- install tensor2tensor

  • and then call t2t-trainer.

  • The problem is the name of the dataset.

  • And it also includes all details like how

  • to pre-process, how to resize images, and so on and so forth.

  • Model is the name of the model and hyperparameter set

  • is which configuration, which hyperparameters

  • of the model, which learning grades, and so on, to use.

  • And then, of course, you need to specify

  • where to store the data, where to store

  • the model checkpoints for how many steps to train and so on.

  • But that's the full command.

  • And for example, you want a summarization model.

  • There is a summarization data set

  • that's been used in academia.

  • It's from CNN and Daily Mail.

  • You say you want the transformer,

  • and there is a hyperparameter set that

  • does well on summarization.

  • You want to image classification,

  • like CIFAR10 is quite a standard benchmark for papers.

  • You say, I want image CIFAR10.

  • ShakeShake model, this was state of the art a year

  • or a year ago.

  • This changes quickly.

  • You want the big model, you go train it.

  • And the important thing is we know this result.

  • This gives less than 3% error on CIFAR, which is, as I said,

  • was state of the art a year ago.

  • Now it's down to two.

  • But we can be certain that when you

  • run this command for the specified number of training

  • steps, you will actually get this state of the art,

  • because internally we run regression tests that

  • start this every day and tell us if it fails.

  • So the usefulness of this framework is not just in--

  • well, we have it grouped into one command.

  • But because it's automated, we can start testing it.

  • If there is a new change in TensorFlow that

  • will break some kernel and it doesn't come out

  • in the unit test, it often comes out

  • in the regression tests of these models.

  • And we found at least three bugs in the recent two versions

  • of TensorFlow, because some things in machine learning only

  • appear--

  • like, things still run, things still train,

  • but they give you 2% less.

  • These are very tricky bugs to find,

  • but if you know which day it started failing,

  • it's much easier.

  • Translation, as I said, it started with transformer.

  • We added more changes.

  • Nowadays, it trains to over 29 BLEU.

  • It's a very good translation model.

  • Just run this command on an 8 GPU machine.

  • Wait.

  • You will get a really good translator.

  • Speech recognition, there is the open librispeech data set.

  • Transformer model without any language model

  • gets a really good word error rate.

  • Some more fancy things, like if you want to generate images,

  • it's recently popular, have a model that just generates you.

  • Either phases or landscapes, there are different datasets.

  • So this is a model that you train just

  • on CIFAR 10 reversed.

  • Every data set in tensor2tensor you

  • can add this underscore rev. It reverses inputs and targets.

  • And generative models, they can take it and generate it.

  • For translation, it's very useful

  • if you want, instead of English, German, and German, English,

  • you just do underscore rev. It reverses

  • the ordering of the dataset.

  • So yeah, so they're the commands.

  • But so for example, on an image transformer,

  • if you try to train this on a single GPU

  • to get to this 2.9 beats per dimension,

  • you'd probably have to wait half a year.

  • So that's not very practical.

  • But that's the point.

  • Currently it's a very hard task to do a very good image

  • generative model.

  • One GPU might not be enough for state of the art.

  • So if you want to really push it, you need to train at scale.

  • You need to train multi GPU.

  • You need to go to TPUs.

  • Well, this is the command you've seen before.

  • To make it multi GPU, you just say worker GPU equals 8.

  • This will use eight GPUs on your machine.

  • Just make batches eight times larger.

  • Run the eight GPUs in parallel, and there it trains.

  • Want to train on a cloud TPU?

  • Use TPU, and you need to specify the master of the TPU instance

  • that you booked on cloud.

  • It trains the same.

  • Want to train on a cloud TPU pod?

  • I don't know, I guess you've heard today,

  • Google is opening up to public the pods which go up to 256,

  • I think, TPU cores.

  • Just say, oh, maybe up to 512, what I see from this command.

  • Just say do it.

  • Train.

  • It will train much faster.

  • How much faster?

  • Well, we've observed nearly linear scaling up

  • to half a pod, and I think, like, 10% loss on a full pod.

  • So these models, the translation models,

  • they can train on a pod for an hour,

  • and you get state of the art performance.

  • So this can really make you train very fast.

  • Same for ImageNet.

  • Well, I say an hour, there's now a competition.

  • Can we get down to half an hour, 18 minutes.

  • I'm not sure how important that is, but it's really fast.

  • Now, maybe you don't just care about training

  • one set of hyperparameters.

  • Maybe you have your own data set and you

  • need to tune hyperparameters, find a really good model

  • for your application.

  • Say cloud ML engine auto tune.

  • You need to say what metric to optimize--

  • so accuracy, perplexity, these are the standard metrics

  • that people tune the models for.

  • Say how many trials, how many of them to run in parallel.

  • And the final line is a range.

  • So a range says, well, try learning grades

  • from 0.1 to 0.3, logarithmically or uniformly.

  • These are the things you specify.

  • So you can specify continuous things in an interval

  • and you can specify discrete things.

  • Just try two, three, four, five layers.

  • And the tuner, it starts the number of parallel trials,

  • so 20 in this command.

  • The first one is random, and then the next one,

  • it has a quite sophisticated non-differential optimizing

  • model which is Bayesian mixed with CMAES.

  • What to try next, it will try another 20 trials.

  • Usually after, like, 60 or so it starts getting

  • to a good parameter space.

  • So if you need to optimize, that's how you do it.

  • And, like, if you're wondering what range to optimize,

  • we have a few ranges in code that we usually

  • optimize for when we start with new data.

  • On a TPU pod, if you want a model that doesn't just

  • do training on large batches, data

  • parallel, but model parallel.

  • If you want to have a model with a huge number of parameters,

  • more than one billion, you can use

  • something we call mesh TensorFlow that we also

  • have started developing in tensor2tensor,

  • which allows to do model parallelism in an easy way.

  • It just say, split my tensor into the cores,

  • how many cores you have.

  • Or split it eight-wise on this dimension and four-wise

  • on this dimension.

  • I'll tell a bit more later about that.

  • It allows you to train really large models if you want this.

  • And that gives really good results.

  • So that's how the library works.

  • You can go and use it with the models and data

  • sets that are there.

  • But what if you want to just get the data from the data set

  • or to add your own data set?

  • Well, it's still a Python library.

  • You can just import it.

  • And there is this problem class, which

  • you can use without any other part of the library.

  • So you can just--

  • you get an instance of the problem class

  • either by-- so we have this registry

  • to call things by name.

  • So you can say registry dot problem and the name.

  • You can say problems dot available to get

  • all the available names.

  • Or you can instantiate it directly.

  • If you look into the code where the class is, you can say,

  • give me this class.

  • And then generate data.

  • The problem class knows where on the internet to find the data

  • and how to pre-process it.

  • So the generate data will go to this place,

  • download it from the internet, and pre-process into TF example

  • files in the same way that we use it

  • or that the authors of this data set

  • decide it is good for their models.

  • And then you call a problem dot data set, which reads it

  • from this, can gives you this queue of tensors

  • in the form of a data set.

  • So that's for data sets.

  • For a model, all our models are a subclass

  • of this t2t model class, which itself is a Keras layer.

  • So if you want to take one model,

  • plug it together with another one, same as with layers.

  • You get a model.

  • You can get it again either by registry or by class name.

  • Call the model on a dictionary of tensors.

  • And you get the outputs and the losses if you need.

  • So you can add your own.

  • You can subclass the base problem class,

  • or for text to text or image to class problems,

  • there are subclasses that are easier to subclass.

  • You just basically point where your images are and get them

  • from any format to this.

  • And for your own model, you can subclass t2t model.

  • If you want to share it, it's on GitHub.

  • Make a PR.

  • Under models, there is a research sub directory

  • where there are models that we don't consider,

  • that we don't regression test.

  • We allow them to be free.

  • If you have an idea, want to share, put it there.

  • People might come, run it, tell you it's great.

  • So yeah, Tensor2Tensor, it's a set of data sets,

  • models, and scripts to run it everywhere.

  • And yeah, looking ahead, it's growing.

  • So we are happy to have more data sets.

  • We are happy to have more models.

  • We are ramping up on regression testing.

  • We're moving models out of research

  • to the more official part to have them

  • tested and stabilized.

  • On the technical side, we are on to simplifying

  • the infrastructure.

  • So TensorFlow 2 is coming.

  • The code base-- well, it's started more than a year ago.

  • It's based on estimators.

  • We are moving it to Keras.

  • We had our own scripts and binaries

  • for running on TPUs and multi-GPUs

  • or moving to a distribution strategy.

  • We are allowing experts to TF hub.

  • So this is a library for training your own models.

  • The main thing is the trainer.

  • Once it's trained and you want to share a pre-trained model,

  • TF hub is the right place.

  • You can export it with one line.

  • And the mesh TensorFlow allows to train huge models

  • on cloud pods.

  • I will tell you a little bit more about it in a moment.

  • On the research side, there's been a lot of research

  • in video models recently.

  • We have a ton of them in Tensor2Tensor.

  • And they're getting better and better.

  • And it's a fun thing to generate your own videos.

  • There is-- the new thing in machine translation

  • is using back translation.

  • So it uses an unsupervised-- you have a corpus of English

  • and a corpus of German, but no matching.

  • And to use a model you have to generate data and then back

  • translate and it shows improvements.

  • And in general, well, hyperparameter tuning

  • is an important thing in research, too.

  • So it's integrated now, and we're

  • doing more and more of it.

  • Reinforcement learning, guns, well, as I said,

  • there are a lot of researchers using it.

  • So there's a lot going on.

  • One of the things, Mesh TensorFlow,

  • it's a tool for training huge models, meaning really huge.

  • Like, you can have one model that uses a whole TPU

  • pod, 4 terabytes of RAM.

  • That's how many parameters you can do.

  • It's by Noam, Youlong, Niki, Ashish, and many people.

  • So what if you want to train an image generation

  • models on high definition videos or process data that's

  • huge even at batch size 1?

  • So you cannot just say, oh, I'll do one thing on one core,

  • another on one core, just split it by data.

  • One data example has to go on the whole machine.

  • And then there needs to be a convolution that applies to it,

  • or a matrix multiplication.

  • So how can we do this and not drown

  • into writing manually, OK, on this core,

  • do this, and then slice back?

  • So the idea is build every tensor, every dimension

  • it has needs to be named.

  • For example, you name the first dimension is batch.

  • The second is length.

  • And the third is just the hidden vector.

  • And for every dimension, you specify how it

  • will be laid out on a device.

  • So you say, OK, batches--

  • for example, modern devices, they have 2D--

  • they're like a 2D mesh of chips.

  • So the communication is fast to nearby chips,

  • but not so fast across.

  • So you can say if it's a grid of chips in hardware, you can say,

  • OK, the batch dimension will be on the horizontal chips

  • and the length will be on the vertical ones.

  • So we define how to split the tensor on the hardware mesh.

  • And then the operations are already

  • optimized to use the processing of the hardware

  • to do fast communication and operate on these tensors

  • as if they were single sensors.

  • So you specify the dimensions by name.

  • You specify their layout.

  • And then you write your model as if it was a single GPU model.

  • And so everything stays simple except for this layout thing,

  • which you need to think a little bit about.

  • We did a transformer on it.

  • We did an image transformer.

  • We can train models with 5 million parameters on TPU pods

  • with over 50% utilization.

  • So this paper, it's also a to do paper,

  • it should be coming out in a few weeks.

  • Not yet there, but it's new state-of-the-art on translation

  • language modeling.

  • It's the next step in really good models.

  • It also generates nice images.

  • So big models are good.

  • They give great results.

  • And this is a way of writing them simply.

  • So yeah, that's the Mesh TensorFlow.

  • And we try to make it-- so it runs on TPU pods,

  • but it also runs on clusters of GPUs,

  • because we tried to not make the mistake again to do something

  • that just runs on one hardware.

  • And with the Tensor2Tensor library,

  • you're welcome to be part of it.

  • Give it a try.

  • Use it.

  • We are on GitHub.

  • There is a GitHub chat.

  • There is an active lobby for Tensor2Tensor, where we also

  • try to be everyday day to help.

  • And yep, that's it.

  • Thank you very much.

  • [APPLAUSE]

LUCASZ KAISER: Hi, my name is Lucasz Kaiser,

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it