TensorFlow in production: TF Extended, TF Hub, and TF Serving (Google I/O '18)

Subtitles section Play video

welcome everyone.
I am Jeremiah and this is tensorflow in production.
I'm excited that you're all here because that means you're excited about production.
And that means you're building things that people actually use.
So our talk today has three parts.
I want to start by quickly drawing a thread that connects all of them in the first thread is the origin of these projects.
These projects really come from our teams that are on the front line, a machine learning.
So these are real problems that we've come across doing machine learning at Google Scale and these air the real solutions that let us do machine learning Google.
The second thing I want to talk about is this observation.
If we look at software engineering over the years, we see this growth as we discover new tools.
As we discover best practices were really getting Maur effective at doing machine doing software engineering and we're getting more efficient.
We're seeing the same kind of growth on the machine learning side, right?
We're discovering new best practices in new tools.
The catch is that this growth is maybe 10 or 15 years behind software engineering, and we're also rediscovering a lot of the same things that exist in software engineering but in a machine learning context.
So we're doing things like discovering version control for machine learning or continuous integration for machine learning.
So I think it's worth keeping that in mind as we move through the talks.
The 1st 1 up is gonna be tense, fearful hub, and this is something that lets you share reusable pieces of machine learning much the same way we share code.
Then we'll talk a little bit about deploying machine learning models with tensorflow serving, and we'll finish up with tensorflow extended, which wraps a lot of these things together in a platform to increase your velocity as a machine learning practitioner.
So with that, I'd like to hand it over to Andrew to talk about Tensorflow Hub.
Thanks, Jeremiah.
Hi, everybody.
I'm Andrew Gasparovic, and I'd like to talk to you a little bit about Tensorflow Hub, which is a new library that's designed to bring reusability to machine learning.
So software repositories have been real benefit to developer productivity over the past 10 or 15 years, and they're great, first of all, because when you're writing something new, uh, if you have a repositories.
You think?
Oh, maybe I'll check whether there is something that already exists and reuse that rather than starting from scratch.
Um, but a second thing that happens is you start thinking maybe I'll write my code in a way that specifically designed for reuse, which is great because it makes your code more modular.
But it also has the potential to benefit the whole community if you share that coat.
What we are doing with Tensorflow Hub is bringing that idea of a repositories to machine learning.
In this case, Tensorflow hub is designed so that you can create, share and reuse components of ML models.
And if you think about it, it's even more important to have a repositories for machine learning even more so than software development.
Because in the case of machine learning, not only are you re using the algorithm and the expertise, but you're also reusing potentially enormous amount of compute power that went into training the model and while of the training data as well.
So all four of those theat algorithm, the training data, the computer and the expertise all go into a module which is terrible.
Tensorflow hub and then you can import those into your model and north models.
Those modules are pre trained, so they have the weights and the tensor flow graph inside.
And unlike a model, they're designed to be compose herbal, which means that you can put them together like building blocks and add your own stuff on top there reusable, which means that they have common signatures so that you can swap one for another and you re trainable, which means that you can actually back propagate through a module that you've inserted into your graph.
So let's take a quick look at an example.
In this case, we'll do a little bit of image classifications and say that we want to make an app to classify rabbit breeds from photos.
But we only have a few 100 example photos, probably not enough to build a whole image classifier from scratch.
But what we could do is start from a general purpose model, and we could take the reusable part of it.
The architecture and waits there take off the classification, and then we could add our own classifier on top and train it with our own examples will keep that reusable part fixed, and we'll train our own classifier on top.
So if you're using Tensorflow hub, you started tensorflow dot org's slash job, where you can find a whole bunch of newly released state of the art research oriented and the well known image modules.
Some of them are include classification, and some of them chop off the classes, classifications, layers and just output feature vectors.
So that's what we want in our own case in this case, because we're going to add classifications on top.
So maybe we'll choose, um, now as net, which is a, uh, an image module that was created by a neural architecture search.
So we'll choose Nassan at a large version with the future vectors.
So we just paste the U R L for the module into our, um, TF hub code.
And then we're ready to use that module just like a function in between.
The module gets downloaded, and instead she ate it into your graph.
So all you have to do is get those feature vectors, add your own classification on top and output the, um, the new categories.
So specifically, what we're doing is training just declassification part while keeping all of the modules waits fixed.
But the great thing about reusing a module is that you get all of the, uh, training and compute that's gone into that reusable portion.
So in the case of NASA, it was over 62,000 GPU hours that went into finding the architecture and training the model, plus all of the expertise, the testing, um, and the research that went into nasnot You re using all of that in that one line of code.
And as I mentioned before, those modules are trainable.
So if you have enough data, you can do fine tuning with the module.
If you set that trainable parameter to true and you select that you want to use the training graph, what you'll end up doing is training the entire thing along with your classifications.
The caveat being that, of course, you have to lower the learning rate s so that you don't ruin awaits inside the module.
But if you have enough training data, it's something that you can do to get even better accuracy.
And in general, we have lots of image modules on t f.
We have ones that air straight out of research papers like NASA.
Net.
We have ones that are great for production, even once made for on device usage like mobile net plus all of the industry standard ones that people are familiar with, like inception and residents.
So let's look at one more example in this case, doing a little bit of text classification will do, uh, look at some restaurant reviews and decide whether they're positive or negative sentiment.
And one of the great things about TF hub is that all of those modules, because they're tensorflow graphs, you can include things like pre processing.
So the text modules that are available on TF hub take whole sentences and phrases not just individual words, because they have all of the token ization and pre processing stored in the graph itself.
So we'll use one of those and basically the same idea.
We're going to select a sentence.
Embedding module will add our own classification on top and will train it with our own data.
But we'll keep the module itself fixed and just like before, we'll start by Gohan in tensorflow dot org's slash hob and take a look at the text modules that are available in this case.
Maybe we'll choose the Universal Sentence Encoder, which is just recently released based on a research paper from last month.
The idea is that it was trained on a variety of tasks and is specifically meant to support using it with a variety of tasks.
And it also takes just a very small amount of training data.
Thio.
Use it in your model, which is perfect for our example Case.
So we'll use that universal sentencing coder and just like before, will pace the U.
R L into our code.
The difference here is we're using it with a text embedding column.
That way we can feed it into one of the high level tensorflow estimators in this case, the d n n classifier.
But you could also use that module it, like I showed any earlier example, calling it just a za function.
If your are using the Texan embedding column that also just like in the other example, can be trained as well and just like any other example.
It's something that you can do with a lower learning rate if you have a lot of training data and it may give you better accuracy and so um we have a lot of text modules available on T F.
We actually just added three new languages to the nlm modules Chinese, Korean and Indonesian.
Those air all trained on Jean use training data.
And we also have a really great module called Elmo from some recent research which understands words in context.
Ah, and of course, the universal sentencing coder, as I talked about.
So just to show you for a minute, some of those girls that we've been looking at, um maybe we'll take apart the pieces here.
Tfl dot dev is our new source for Google and selected partner published modules.
In this case, this is Google.
That's the publisher.
And the universal sentence encoder is the name of the module.
The one at the end is a version number.
So Tensorflow Hub considers, um, modules to be immutable.
And so the version number is there so that if you're, you know, doing one training run and then another, you don't have a situation where the module chain changes unexpectedly.
So all modules on TF abduct ever version that way, and one of the nice things about those you Earl's if you paste them into a browser.
You get the module documentation, the idea being that maybe you read a new paper, you see.
Oh, there's a U.
R L for T F up module and that you paste it into your browser.
You see the documentation, you paste it into some code, and in one line you're able to use that module and try out the new research.
And speaking of the universal encoder, the team just released a new light version, which is a much smaller size.
It's about 25 megabytes, and it's specifically designed for cases where the full text module wouldn't work for doing things like on device classifications.
Also today, we released a new module from Deep Mind, this one you can feed in video, and it will classify and detect the actions in that video.
So in this case, it correctly guesses the video is of people playing cricket.
And of course, we also have a number of other interesting modules.
There's a generative image module which is trained on Celebi.
It has a progressive gan inside and also the deep local features module, which I can identify the key points of landmark images.
Those air all available now on D F.
And last but not least, I wanted to mention that we just announced our support for Tensorflow.
J s.
So, using the Tensorflow Js converter, you can directly convert a T F of module into a format that can be used on the Web.
It's, ah, really simple integration to be able to take a module and use it in the Web browser with tensorflow dot Js, and we're really excited to see what you build with it.
So just to summarize, Tensorflow Hub is designed to be a starting point for reusable machine learning, and the idea is just like with a software repositories.
Before you start from scratch, check out what's available on Tensorflow Hub and you may find that it's better to start with the module and import that into your model rather than starting the task completely from scratch.
We have a lot of modules available, and we're adding more all the time, and we're really excited to see what you built.
So thanks.
Next up is Jeremiah to talk about T F servant.
All right.
Thank you, Andrew.
So next tensorflow serving.
This is gonna be how we deploy modules are deployed models to see get a sense for where this falls in the machine learning process, right?
We start with our data.
We use tensorflow to train a model in the output.
Our artifact.
There are these models, right?
These air saved models.
It's a graphical representation of the data flow.
And once we have those, we want to share them with world.
That's where Tensorflow serving comes in.
It's this big orange box.
This is something that takes our models and expose them to the world through a service so clients can make requests.
Tensorflow serving will take them, run the inference, run the model, come up with an answer and return that in a response.
So Tensorflow serving is actually the libraries and binaries.
You need to do this to do this production grade inference overtrained tensorflow models.
It's written in C++ and supports things like G.
R.
P.
C.
And plays nicely with kubernetes.
So to do this well, it has a couple of features.
The first and most important is it supports multiple models.
So on one tensorflow model server, you can load multiple models, right?
And just like most folks probably wouldn't push a new binary right to production.
You don't want to push a new model right to production, either.
So having these multiple models in memory lets you be serving one model on production traffic and load a new one and maybe send it some canary.
Request send.
It's, um, que es requests.
Make sure everything's all right and then move the traffic over to that new model.
And this supports doing things like reloading.
If you have a stream of models, you're producing tensorflow, serving well transparently.
Load the new ones and unload the old ones.
We've built in a lot of isolation.
If you have a model that serving a lot of traffic in one thread and it's time to load a new model, we make sure to do that in a separate thread.
That way, we don't cause any hiccups in the thread that serving production traffic.
And again this entire system has been built from the ground up.
To be very high throughput.
Things like selecting those different models based on the name or selecting different versions that's very, very efficient, similarly has some advanced matching right This way we can make use of accelerators.
We also see improvements on Standard CP use with dispatching and then lots of other enhancements.
Everything from protocol buffer, magic toe lots more and this is really what we use inside Google to serve tensorflow.
I think there's over 1500 projects that use it.
It serves somewhere in the neighborhood of 10 million Q P s, which ends up being about 100 million items predicted per second.
And we're also seeing some adoption outside of Google.
One of the new things I'd like to share today is distributed serving.
So looking inside Google, we've seen a couple of trends.
One is that models are getting bigger and bigger.
Some of the ones inside Google are over a terabyte in size.
The other thing we're seeing is this Sharing of sub graphs, right?
TF hub is producing these common pieces of models.
We're also seeing more and more specialization in these models as they get bigger and bigger, Right.
If you look at some of these model structures, they look less like a model that would belong on one machine and more like an entire system.
So that's this is exactly what distributed serving is meant for you and it lets us take the single model and basically break it up into micro service is so to get a better feel for that, we'll see that Andrew has taken his rabbit classifier in this, serving it on a model server and will say that I want to create a similar system.
So classify cat breeds.
And so I've done the same thing.
I've started from Tensorflow Hub so you can see I've got the tensorflow hub module in the center there and you'll notice that since we both started from the same module, we have the same bits of code.
We have the same core to our mission in their model.
So we can do.
We can start 1/3 server and we can put the tensor full hab module on that server and we can remove it from the servers on the outside and leave in its place.
This place holder we call a remote op.
You can think of this as a portal.
It's kind of a forwarding up that when we run the inference it forwards at the appropriate point in the processing to the model server.
They're the computation is done, and the results get sent back and the computation continues on.
Our classifier is on the outside, so there's a few reasons we might want to do this right.
We can get rid of some duplication.
Now.
We only have one model server loading all these weights way.
Also get the benefit that that can batch requests that are coming from both sides.
And also we can set up different configurations.
You can imagine we might have this model server just loaded with TP use our tensor processing units so that it can do what are most likely convolution all operations and things like that very efficiently.
So another place where we use this is with large shard ID models.
So if you're familiar with deep learning, there's this technique of embedding things like words or YouTube video ID's as a string of numbers, right?
We represent them as this vector of numbers, and if you have a lot of words, you have a lot of YouTube videos.
You're gonna have a lot of data so much that it won't fit on one machine.
So use a system like this to split up those M beddings for the words into these shards and weaken distribute their and Of course, the main model, when it needs something, can reach out, get it and then do the computation.
Another example is what we call triggering models.
So we'll say we're building a spam detector and we have a full model, which is a very, very powerful spam detector.
You know, maybe it looks at the words, understands the context.
It's very powerful, but it's very expensive, and we can't afford to run it on every single email message we get.
So we do instead, is we put this triggering model in front of it.
As you can imagine, there's a lot of cases where we're in a position to very quickly say yes, this is spammer.
No, it's not.
So, For instance, if we get an email that's from within our own domain, maybe we can just say that's not spam.
And the triggering model can quickly return that if it's something that's difficult, you can go ahead and forward that on to the full model where it will process it.
So a similar concept is this mixture of experts.
So in this case, let's say we want to build a system where we're going to classify the breed of either a rabbit or a cat.
So what we're gonna do is we're gonna have to.
Models were gonna call expert models, right?
So we have one that's an expert at rabbits and another that's an expert at cats.
And so here we're gonna use a gating model to get a picture of either a rabbit or cat.
And the only thing that's gonna do is decide if it's a rabbit or a cat and forward it on to the appropriate experts who will process it, and we'll send that results back.
All right, there's lots of use cases.
We're excited to see what people start to build with these remote ops.
Next thing I'll quickly mentioned is arrest a p I.
This was one of the top requests on Get Hub, so we're happy to be releasing this soon.
This will make it much easier to integrate things with existing existing service is, and it's next because you don't actually have to choose on one model server with one tensorflow model, you can serve either the rest will end point or the G r P C.
There's three AP eyes.
There's some higher level ones.
Look for classification and regression.
There's also a lower level predict, and this is more of a tensor intense her out for the things that don't fit into, classify and regress.
So looking at this quickly now you can see the U.
R I hear weaken.
Specify the model, right?
This may be like rabbit or cat.
We can optionally specify aversion and are verbs are classified regressing.
Predict we have two examples.
The 1st 1 you can see.
We're asking the iris model to classify something in here.
We aren't giving it a version model version, so we'll just use the most recent or the highest version automatically.
And the bottom example is one where we're using the chemist model, and we're specifying the version to be 314 and asking it to do a prediction.
So this lets you.
This lets you easily integrate things and usually version models and switch between them.
I'll quickly mention the AP I if you're familiar with Tensorflow example, you know that representing it in Jason is a little bit cumbersome, so you can see it's pretty verbose here.
There's some other warts, like needing to encode things based 64 instead, with Tensorflow serving.
The recipe I use is a more idiomatic Jason, which is much more pleasant, much more succinct.
And here this last example just kind of pulls it all together where you can use curl, actually make predictions from the command line.
So I encourage you to check out the project at TENSORFLOW serving.
There's lots of great documentation and things like that.
And we also welcome contributions and code discussion ideas under get project page.
So, like to finish with James to talk about tensorflow extended.
All right, thanks.
All right.
So I'm gonna start with a single non controversial statement.
This has been shown true, made many times by many people.
In short, T FX is our answer to that statement.
We'll start with a simple diagram.
Uh, this core box represents your machine learning code.
This is the magic bits of algorithms that actually take the data in and produce reasonable results.
The blue boxes represent everything else.
You need to actually use machine learning reliably and scalable e in an actual real production setting.
The blue box.
Is there going to be where you're spending most of your time?
It comprises most of the lines of code.
It's also gonna be the source of most of the things they're setting off your pages in the middle of the night.
In our case, if we squint at this just about correctly, the court ML box looks like tensorflow, and all of the blue boxes together comprised the FX.
So we're gonna quickly run through four of the key principles that affects was built on First Express ability, and T F X is going to be flexible in three ways.
First of all, we're gonna take advantage of the flexibility built into tensorflow.
Using it as our trainer means that we can do anything tenser for looking, too, with the model level, which means you can have white models, deep models, supervised models, unsupervised tree models, anything that we can whip up together.
Second, were flexible with regards to input data.
We could handle images, texts, sparse data, multimodal models.
You might want to train images and surrounding text or something like videos plus captions.
Ah, third, there are multiple ways you might go about actually training a model.
If your goal is to build a kitten detector, you may have all of your data up front and you're going to be to build one model of sufficient high quality, and then you're done.
In contrast to that, if your goal is to build a viral kitten video detector or a personalized kitten, recommend ER then you're not gonna have all of your data up front.
So typically, your trade a model, get it into production.
And then as data comes in, you'll throw away that model and train a new model and then throw away that model and train a new model.
We're actually throwing out some good date along with these models, though, so we can try a warm starting strategy instead.
Where will continuously train the same model?
But as data comes in, will warm start based on the previous state of the model, and just add the additional new data.
This will let us result in higher quality models with faster convergence.
Next, let's talk about portability, so each of the tea FX modules represented by the Blue boxes don't need to do all of the heavy lifting themselves.
They're part of an open source ecosystem, which means we can lean on things like tensorflow and take advantage of its native portability.
This means we can run locally, we can scale up and running the cloud environment, weaken scale two devices that you're thinking about today and two devices that you might be thinking about tomorrow.
Ah, large portion of machine learning is data processing, so we rely on Apache Beam, which is built for this task.
And again, we can take advantage of beams portability as our own, which means we can use the direct runner locally where you might be starting out with a small piece of data building small models, too.
Affirm that your approach is air actually correct, and then scale up into the cloud with the data flow rudder also utilize something like the Flick Runner or things that are in progress right now, like a spark of honor.
We'll see the same story again with Kubernetes, where we can start with mini coop running locally scale up into the cloud or two clusters that we have for other purposes and eventually scale the things that don't yet exists.
But they're still in progress, so portability is only part of the scalability story.
Traditionally, we've seen two very different roles involved.
Machine Learning's.
You'll have the data scientists on one side and the production infrastructure engineers on the other side.
The differences between these are not just amounts of data, but there are key concerns that each has about as they go about their daily business with T FX weaken.
Specifically, target use cases that are in common between the two, as well as things that are specific to the two.
So this will allow us to have one unified system that can scale up to the cloud and down to smaller environments and actually unlock collaboration between these two rolls.
Finally, we believe heavily in interactivity, you're able to get quick editor of results with responsive tooling and fast debugging.
And this interactivity should remain such even its scale with large sets of data or large models.
This is a fairly ambitious goal.
So where are we now?
So today we have open sourced a few key areas of responsibility, so we have transformed model analysis, serving and facets.
Each one of these is useful on its own, but is much more so when used in concert with the others.
So what's walk through what this might look like in practice?
Siro Goal here is to take a bunch of data we've accumulated and do something useful for our users of our product.
These are the steps you want to take along the way.
So let's start to step one with the data.
We're gonna pull this up in facets and use it to actually analyze what features might be useful predictors.
Look for any anomalies so outliers in their data or missing features to try to avoid the classic garbage in garbage out problem and to try to inform what data we're gonna need to further pre process before it's useful for our ML training, which leads into our next step, which is to actually use transform to transform our futures so to transform will let you do full past analysis and transforms of your base data.
And it's also very firmly attached to the TF graph itself, which will ensure that you're applying the same transforms in training as in serving from the code.
You can see that we're taking advantage of a few ops built in to transform, and we could do things like scale, generate vocabularies or bucket.
Tizer based data and this code will look the same regardless of our execution environments and Of course, if you needed to find your own operations, you can do so.
So this puts us at the point where we're strongly suspicious that we have data we can actually use to generate a model.
So let's look at doing that.
We're gonna use a tense, awful estimator, which is a high level, a p I that Well, let us quickly define train and export our model.
This is a small set of estimators that are present in court tensorflow there are a lot more available and you can also create your own.
We're gonna look ahead to some future steps and we're gonna purposefully export two graphs into our save model, one specific deserving and one specific to model evaluation and again from the code, you can see that we're going in this case, we're gonna use a wide and deep model.
We're gonna define it, We're going to train it.
We're going to our exports.
So now we have a model.
We could just push this directly to production, but that would probably be a very bad idea.
So let's try to get a little more confidence and what would happen if we actually did so for our end users.
So we're gonna step into T f model analysis.
We're gonna utilise this to evaluate our model over a large data set, and then we're going to define in this case one.
But you could possibly use many slices of this data that we want to analyze independently from others.
This will allow us to actually look at subsets of our data that may be representative of subsets of our users.
And how are metrics actually track between these groups?
For example, you may have sets of users and different languages, maybe, except with different devices.
Or maybe you have a very small but passionate community of rabbit aficionados mixed in with your larger community of kitten fanatics.
And you want to make sure that your model will actually give a positive experiences to both groups equally.
So now we have a model that we're confident in, and we want to push it to serving.
So let's get this offense diverse Enquiries at it.
So this is quick.
Now we have a model up.
We have a server listening import 9000 for gr prissy requests.
Now we're gonna back out into our actual products code we can assemble individual prediction requests, and then we can send them out to our server.
And if this slide doesn't look like your actual code and this one looks more similar than you'll be happy to see that this is coming soon, I'm treating a little by showing you this now as current state.
But we're super excited about this, and this is one of those real soon now scenarios.
So that's today what's coming next.
So first, please contribute.
Enjoying the tents or flood or community we don't want the only time that we're talking back and forth here to be at summits and conferences.
Secondly, some of you may have seen the tea FX paper a k d d.
Last year.
This specifies what we believe it end to end.
Platform actually looks like here it is, and by we believing that this is what it looks like, This is what it looks like.
This is actually what's powering some of the pretty awesome first products that you've been seeking a Iot and that you've probably been using yourselves.
But again, this is where we are.
No, SS Right now, this is not the full platform, but you can see what we're aiming for and we'll get there eventually.
So again, please download the software, used it to make good things and send us feedback and thank you from all of us for being current and future users and for choosing to spend your time with us today.