Subtitles section Play video
ANDREW SELLE: So today I'm going to talk about TensorFlow Lite,
and I'll give you an intro to what that is.
My name's Andrew Selle.
I'm a software engineer down in Google in Mountain View land.
All right, so introduction.
We're going to be talking about machine learning on-device.
So how many of you are interested on running
your machine learning models on-device?
How many already do that?
OK, so a lot.
So that's great.
So you already know that machine learning on-device
is important.
So I'm going to give you a scenario that's
perhaps a little bit out of the ordinary.
Suppose I'm going camping.
I don't have any power.
I don't have any network connectivity.
I'm out in the mountains.
And I want to detect bears.
So I want to hang my cell phone on the outside of my tent
so that if I'm in the middle of the night
and the bear is coming for me, I'll know.
I don't know what to do if that happens,
but that's a great example of what you do with on-device ML.
So you basically want to have low latency.
You don't want to wait for the bear to already be upon you.
You want some early warning.
You want to make sure it works without a data connection.
The data can stay on the device and you have access
to the sensors.
So there's a lot more kind of practical and important use
cases than that.
But that kind of sets the stage for it.
And so people are, of course, doing a lot more ML on-device.
So that's why we started TensorFlow Lite.
So in short, TensorFlow Lite is our solution
for running on-device machine learning
with low latency and a small binary size
but on many platforms.
So here's an example of what TensorFlow Lite can do.
So let's play our video.
So, suppose we have these objects
and we want to recognize them.
Well, I know what they are, but I want my phone
to be able to do that.
So we use an image classification model.
So here we have a marker.
In the bottom you see kind of the confidence.
So this is an image recognition.
It's happening in real time.
It can do kind of ordinary office objects,
but also important objects like TensorFlow logos.
And it also runs on multiple platforms.
So here we have it running on an Android phone and an IOS phone.
But we're not just limited to phones.
We can also do things like this Android Things Toolkit.
And there is going to be a couple more later, which
I'll show you some more.
So now that we've seen it in action,
let's talk a little bit more about what TensorFlow Lite is.
Do you want to see the video again?
I don't know.
OK, there we go.
The unfortunate thing about on-device
though is it's much harder.
It has tight memory constraints, you need to be low energy,
and you don't have as much computation as you do available
on the cloud.
So TensorFlow Lite has a few properties
that are going to sort of deal with these problems.
So the first one is it needs to be portable.
You know, normal PCs we run on, but we also
run on mobile phones, we run on Raspberry Pi, or other Linux
SOCs, IoT type devices.
And we also want to go to much smaller devices
like microcontrollers.
So, next slide.
The internet doesn't like me today.
Well, we'll skip that slide, whatever that was.
But in any case, basically, this portability
is achieved through the TensorFlow Lite file format.
So once you have a trained TensorFlow model, like you
author, you could author it with Swift,
you could author it with Python, whatever you do,
you produce a saved model in a graph form.
The serialized form is then converted to TensorFlow Lite
format, which then is your gateway to running on all
these different platforms.
And we have one more special converter
which allows you to go to Core ML
if you want to target IOS in this especially way.
The second property is that it's optimizable.
We have model compression, we have quantization,
we have CPU kernel fusion.
These are all optimization techniques
to ensure that we have the best performance, as well
as a small size.
And this is achieved through the architecture.
So we have a converter, which we already talked about,
and then we have an interpreter core.
The interpreter core delegates to kernels
that know how to do things, just like in TensorFlow.
But unlike in TensorFlow these are
optimized for mobile and small devices with NEON on ARM.
An additional thing that we have that TensorFlow
doesn't have is the notion of delegation.
So we can delegate to GPUs or Edge TPUs or accelerators.
And this basically gives TensorFlow Lite
the chance to give part of the graph
to a hardware accelerator that can do processing
in a special way.
So one of those we've talked about before is in an API.
We're excited to announce that in 2018 Q4
we're looking to release our OpenGL-based GPU
delegate, which will give us better performance on GPUs,
and will also accelerate things like MobileNet
and another vision-based models.
So that's really exciting.
In addition, at Cloud Next there was an announcement
about Edge TPUs.
And Edge TPUs are also special as well
because they give us the ability to do
high performance per watt, and also
fit into a small footprint.
So for example, the device is there on that penny there,
but it's on a development board.
So you can put it in many different form factors as well.
Then the third property is that it's parametrically sized.
So we know the TensorFlow Lite needs
to fit on small devices, especially very small devices
like MCUs, And there, you might need to only include
the ops that you need.
So our base interpreter is about 80 kilobytes,
and with all built-in ops it's 750 kilobytes.
We're moving to a world where you can parameterize
what you put into the TensorFlow Lite interpreter,
so you can trade off the ability to handle
new models that use new ops, and the ability to only ship what
you need in your application.
So, we introduced TensorFlow Lite last year
and we've been asking users what they think of TensorFlow Lite.
And they really love the cross platform deployment
that they can deploy to IOS, to Android
with the same kind of format.
They like that they can decouple the distribution of the binary
from the distribution of the model.
And they like the inference speed increases.
And they're really excited about the hardware acceleration
roadmap.
But the biggest feedback that we got
is that we should focus on ease of use, we should add more ops,
and we should work on model optimization,
and we should provide more documentation.
And so we've listened.
So what I want to do in this talk
is focus on the user experience.
But before we do that, let's look
at some of the users that are already using it so far.
We have a lot of first party users
and a lot of third party users that are excited about it.
And I hope after this talk you'll be interested as well.
So, the user's experience.
I'm a new user and I want to use TensorFlow Lite.
How do I do it?
Well, I think of it as kind of like learning to swim.
You can think of two things you might do.
You might wade, where you don't really have to swim.
But it's really easy to get started
and you get to cool off.
The second thing is that you can swim
where you can do custom models.
So we're going to talk about both of those.
But before we get into that, there's
an easier thing and then a harder thing.
So the easier thing is to dip your toes, which are demos.
And the harder thing is to just dive full in
and have full kind of mastery of the whole water,
and that would be optimizing models.
And we'll talk about those as well.
So as far as using demo apps, you can go to our website,
you can download the demos and compile it and run it.
It'll give you a flavor of what can be done.
I showed you one of those demo apps.
You can try it for yourself.
The next step is to use a pre trained model.
So the demo app uses a pre trained model.
So you can use that model in your application.
So if you have something that could benefit from say ImageNet
style classification, you can just take that model
and include it.
Another thing that's really useful is retraining.
So let me show you a retraining workflow, which
is you take a pre trained model and you kind of customize it.
Great, so we're running this video.
We're showing you the scissors and the Post-It Notes
as before, and here's an application
that I built on PC that allows me to do retraining.
But we're running inference with TensorFlow Lite.
So it knows the scissors, it knows the Post-It Notes,
but what if we got to a really important object, one
that we haven't seen before, perhaps
like something everybody has, the a steel TensorFlow logo.
How is it going to do on that?
Well, not well is the unfortunate thing.
But the good thing about machine learning is we can fix it.
We just need some examples.
So here, this app allows us to collect examples.
And you could imagine putting this into a mobile app
where you just move your phone around
and it captures a bunch of examples.
I'm going to do the same thing, except on Linux.
It's a little bit more boring, but it does a job.
So we want to get a lot of examples
to get a lot of generalization.
So once we're happy with that, we'll hit the Train button
and that's going to do some machine learning.
And once it's converged, we're going to have a new model.
It's going to convert it to TensorFlow Lite, which
is going to be really great.
We can test it out.
See if it's now detecting this.
And indeed it is.
That's good.
The other cool thing about that is now
that we have this TF Lite flat buffer model,
we can now run it on our device and it works as well.
All right, great.
So, now that we've done pre train models and we've done--
let's get into full on swimming.
There's basically four steps that we need to work on.
The first one is building and training
the model, which we've already talked about.
You could do that with, again, Swift would be a great way
to do that.
The second one is converting the model.
Third one is validating the model.
And the fourth is deploying model.
Let's dive into them.
Well, we're not going to dive yet.
We'll swim into them.
OK.
Build and train the model.
So the first thing to do is to get a saved model of your model
and then use the converter.
This can be invoked in Python.
So you could have your training script
and you could have the last thing
you do is to just always convert it to TensorFlow Lite.
I in fact recommend that, because that
will allow you to make sure that it's
convertible right from the start.
So you give it the saved model in,
and you provide the TF Lite buffer out.
Great.
And then when you're done with that, it will convert.
Except we don't have all the ops.
So sometimes it won't.
So sometimes you want to visualize the TensorFlow model.
So a lot of models do work, but some of them
are going to have missing ops.
So as we've said, we've listened to your feedback
and to address this we've provided these visualizers
so you can understand your models better.
They're kind of analogous to TensorBoard.
In addition, we've also added 75 built-in ops,
and we're announcing a new feature,
which will allow us to run TensorFlow
kernels in TensorFlow Lite.
So basically, this will allow you
to run normal TensorFlow kernels that we
don't have a built-in op for.
There is a trade to that, because that increases
your binary size considerably.
However, it's a great way to get started,
and it will kind of allow you to get into using TensorFlow Lite
and deploy your model if binary size is not
your primary constraint.
OK, great.
Once you have your model working and converted,
you will definitely want to validate it.
Every step of machine learning it's
extremely important to make sure it's still
running the way you think.
So if you have it working in your Python test bench
and it's running, you need to make sure it's also
running in your app.
This is just good practice, that end-to-end things
are producing the right answer.
In addition, you might want to do profiling
and you might want to look at what your size is.
Once you've done that, you want to convey this model
to the next phase, which is optimization.
We're not going to talk about that later,
but that's what you would do with those results.
OK, so how do you deploy your Model we have several APIs.
In the previous time I talked about this,
we had c++ and Java.
Kind of in May or so we introduced a Python API.
And I'm excited to talk about our C API, which
is a way in which we're going to implement
all of our different APIs, similar to how TensorFlow does
it.
In addition, we're also introducing an experimental C
Sharp API, which allows you to use it in a lot of toolkits
that are C Sharp based.
The most notable of which--
which is a feature request--
was Unity.
So if you want to integrate it into say a game.
OK, and then third, Objective C to get a more idiomatic
traditional IOS experience.
Great.
Let me just give you an example of some code here.
So the basic idea is you give it the file
name of the flat buffer model, you fill in the inputs,
and you call invoke.
Then you read out the outputs.
That's how all the APIs has work, no matter
what language they are.
So this was Python.
The same is true in Java.
The same is true in c++.
Perhaps the C one is a little bit more verbose,
but it should be pretty intuitive.
OK, now that we know how to swim,
let's go into diving into models.
How do we optimize a model?
So once you have your model working,
you might want to get the best performance possible,
and you might want to leverage custom hardware.
This traditionally implies modifying the model.
So the way in which we're going to do this is we're
going to put this into our [INAUDIBLE] loop.
We had our four steps before as part of swimming,
but now we have the additional diving model
where we do optimization.
And how does optimization work?
We're introducing this thing called the model optimization
toolkit, and the cool thing about it
is it allows you to optimize your model either post
training or during training.
So that means that you can do a lot of things
without retraining the model, but to get the--
let me just give an example, which is right now
we're doing quantization.
So there's two ways to do quantization.
One is, you take your model and look at what ranges it uses
and then just say we're going to quantize this model right now
by just setting those ranges.
So that's called post training quantization.
So all you need to do that is add a flag to the conversion.
I showed you the Python converter before.
There's also a command line one.
But both of them have this option to quantize the weights.
In addition, if you want to do a training time quantization,
we introduced the tool kit for doing this.
This is now kind of put under the model optimization toolkit,
and this will basically create you a new training graph
that when you run it, it will give you
the most optimal training quantized graph that you could.
It kind of takes advantage-- it takes into account that it
is going to be quantized.
So basically, the loss function is aware of the quantization.
So that's good.
OK.
So, just one more thing that I want to talk about,
which is roadmap.
We're looking actively into things like on-device training,
which will, of course, require us to investigate control flow.
We're adding more techniques to the optimization toolkit.
We'd also like to provide more hardware acceleration support.
And the last thing is for TensorFlow 2.0,
we're moving out of contrib and into TensorFlow.
So we'll be under TensorFlow slash Lite.
OK, so a couple demos.
I wanted to show a couple of models that were using--
so TensorFlow Lite.
So for example, here's one model that
allows you to see the gaze.
So it's running in real time.
It basically puts boxes around people
and kind of gives a vector of which direction
they're looking.
And this is running in real time on top of TensorFlow
Lite on an Edge TPU.
Let me show you another one.
Oh, sorry.
OK.
It's very tricky.
There we go.
Here's another one that's kind of interesting.
Again, this is using a variant of SSD.
It's basically three autonomous agents,
or two autonomous agents and one human driven agent.
Two of the agents are trying to catch the other one
and the other one's trying to avoid.
And they're all input in this SSD.
Basically, the upshot of this is that it uses SSD that's
accelerated with Edge TPUs.
It's about 40% faster using to Edge TPUs and TF Lite.
And I have one more demo, which is an app that's
using TF Lite called Yummly.
And basically, this is able to give you
recipes based on what it sees.
So let's just see it in action.
So this was originally built on TF Mobile,
but then moved to TF Lite.
So, this is their demo.
So essentially, you point your phone at what's in your fridge
and it will tell you what to make with it.
This is good for me, because I don't
have any creativity on cooking and I have
a lot of random ingredients.
So we're really excited by what people are using TF Lite for.
I want to show you one more demo,
which I just made last week with some of my colleagues.
And it's basically running on a microcontroller.
So this is basically a microcontroller
with a touch screen that has only one
megabyte of flash memory and 340 kilobytes of RAM.
So this is sort of pretty small, and we're
doing speech recognition on it.
So I say, yes.
And it says, yes.
It says no.
It says no.
And if I say some random thing, it says unknown
So pre-recorded.
Unfortunately, I don't have the sound on yet.
But this is just showing that we can run the same interpreter
code on these really small devices.
So we can go all the way to IoT, which I think is super exciting
and will introduce a whole new set of applications
that are possible.
So with that, I already told you this, which
we're moving out of contrib.
I'd like you guys to try out TensorFlow.
Send us some information.
If you're interested in discussing it,
go on to our mailing list, tflite@tensorflow.org.
And we're really excited to hear about new use cases
and to hear feedback.
So thank you.
We're going to-- both of us are going to be at the booth
over in the grand ballroom.
So if you want to talk to us more about either Swift
or TensorFlow Lite, that would be a great time to do it.
And thank you.
[APPLAUSE]