Subtitles section Play video
[MUSIC PLAYING]
SACHIN JOGLEKAR: Hey.
I'm Sachin from the TensorFlow Lite team,
and I'm here to talk about delegates.
Before I go into the details, I would
like to go over some of the basics of what delegation is.
Typically, a user would start with a TensorFlow model
and use a converter to convert the model into the TFLite
format.
This TFLite file would then be handed down
to our interpreter, which runs the model on device.
By default, models run in the CPU.
So the interpreter would call out
to our CPU Op Kernels that are highly optimized for the ARM
Neon instruction set.
However, most devices these days, especially mobile phones,
have a lot of other chips, like mobile GPUs or DSPs.
And this is where delegates come in.
Our Delegate API acts like a bridge
between the TensorFlow Lite runtime and lower level
accelerated APIs.
For example, our NNAPI delegate acts
as an interface between TensorFlow Lite and Android's
neural network API.
Or the GPU delegate uses OpenCL and OpenGL
to run inference on mobile GPUs on Android devices.
A natural question here is why would you use delegates at all.
The most obvious benefit is foster inference.
The classic example here is the GPU delegate.
Because of the highly parallelized nature of the GPU,
it is very good at performing matrix
math, such as convolutions or fully connected layers.
As a result, when we use our GPU delegate with TensorFlow Lite,
we observe up to seven speed ups with a lot of division models
that are currently used on mobile devices.
Another great benefit is lower power consumption.
A good example here is the DSP, or the digital signal
processor.
DSPs are meant for applications such as multimedia
and communication, which inherently require less power
consumption.
So when you use a DSP for inference,
it consumes up to 70% less power,
which is what we observed when we
used our delegates that leverage Qualcomm's Hexagon DSP to run
even some of the mobile optimized models,
such as the MobileNet or the MobileNet SSD.
Now, suppose you have your own secret accelerator,
and you want to use our delegate API to write your own delegate.
Let's see how it would work in code.
So the bulk of how the interpreter delegates nodes
is in this function that we like to call DelegatePrepare.
This function gets an object called the TFLite context,
which is essentially an interface into the TensorFlow
Lite runtime for the delegate.
Using the context, the delegate first
gets the execution plan, which is
nothing but a list of nodes that are going
to be executed in sequence.
For each node, the delegate can look
at different kinds of information,
such as what op it executes or what are the types of input
of tensors or the shapes.
This has the delegate make an informed decision
about which ops it can accept to be delegated.
Once this list of supported nodes is populated,
the delegate calls a function called
ReplaceNodeSubse tsWithDelegateKernels.
This function takes two main arguments.
One is this list of supported nodes,
and the other is what we call the kernel administration.
We'll get to that in a minute.
But let's look at what the runtime does when
the delegate calls this method.
Here we have a simple example of a model of Add and Mul ops,
and let's say our delegate only supports the Add operation.
So once the delegate calls this method with the nodes,
the runtime partitions all the nodes into two subsets,
or two types of partitions.
One is delegated, and the other is non-delegated.
Now, there are two reasons why this happens.
First is many of the delegates optimize inference
by fusing a lot of the ops together.
So this way, the delegate can maximize the fusings and fuse
as many ops as possible.
Another great reason is that it's very expensive
to go back and forth between the CPU and the accelerator,
especially due to memory transfers.
And therefore, the lesser the number
of partitions, the more optimized
the inference becomes.
Once this partitioning is done, the TensorFlow Lite runtime
replaces each delegated partition
with one single delegate op.
At this point, the delegate op behaves just
like any other TFLite node for the runtime.
And the behavior of this delegate
up is what is defined by the kernel implementation,
or the TFLite illustration that we saw in the previous slide.
The two main methods that need to be
implemented for this kernel administration--
the first one is Init.
This method is run at initialization time.
That is when delegation is happening.
Here the delegate, or the delegate kernel,
gets a list of delegate patterns, which
is essentially nodes that it is responsible for,
and parallel tensors, and information like that.
With this information, the delegate
is free to initialize any opaque object that it can create.
The return type is void* so the runtime is completely agnostic
to what type of object is written,
as long as it is not null.
Then at inference time, we run this method called Invoke.
In Invoke, the kernel gets back the object returned
during Init, and it is free to do whatever it wants to as long
as the implementation is semantically similar to what
the delegated partition would have done.
Now that we know what happens under the hood,
let's look at some of the delegation options in TFLite.
The first delegate that we have is
the NNAPI delegate, which supports
a lot of different accelerators, such as the DSP, GPU, and NPU.
It draws a variety of vendors.
It runs on Android P and above.
It supports more than 30 ops on Android P
and over 90 ops on Android Q. This
is one of the very few delegates that accepts both floating
point and integer models.
This is how you would typically run inference
with the NNAPI delegate using our Java interface.
The main idea is that you initialize the delegate
instance and you pass it on to our interpreters.
And the rest of your business logic
remains pretty much the same.
There's not much else you have to do for delegates, apart
from just these couple of lines of initialization and cleanup
at the end.
Then we have our GPU delegate, which, as you mentioned before,
gives up to seven times speed up on a lot of the version models
that involve a lot of convolutions
and fully connected layers.
It uses OpenCL and OpenGL on Android and Metal for iOS.
Currently it only accepts 14 point models, both 16-bit
and 32-bit.
We are working to add Vulkan support to the GPU delegate,
as well as inference for Quantized Models.
So stay tuned for that.
This is how you would do things with a GPU delegate.
The thing to note is that apart from the class name, which
is GPU delegate, instead of an API delegate,
everything else pretty much remains the same.
We are excited to announce the release of the Qualcomm Hexagon
DSP delegate that we announced a couple of weeks back.
This delegate provides up to 25 speed up
for quantized uint8 models.
Our general directive is to use this delegate on Android O
and below, and use NNAPI delegate on Android P
and above, or in environments where you may not have
the Android operating system.
We are working with Qualcomm to add support
for models which are per-channel quantized.
So you can make use of our post-channel quantization
tooling to run those same models with the Hexagon delegate.
Again, the inference is pretty similar.
The only difference is now that you
have to initialize the Hexagon delegate instead of the GPU
delegate object.
We are so excited to announce the Core ML delegate, which
uses Apple's neural engine to run faster inference on iOS
devices.
It runs on the A12 SoC and above,
and it provides up to 14x speedup
on a lot of the mobile models that are
used in on-device inference.
It is available on iOS 11 and later.
This is how you would run the inference with the Core ML
delegate using Swift, which is the language of choice
for iOS development.
The basic idea remains the same, that you initialize the object
and then you pass it on to an interpreter,
with the rest of the logic, apart from the inference,
remaining the same.
Of course, it's not as easy as taking any random model
and giving it any delegate.
You have to think about a few questions
before you choose a delegate to use with your model.
So the first consideration is whether the model
is supported on the delegate.
For example, if you pass a floating point model to the DSP
delegate, nothing will happen.
Or if you give the GPU delegate a quantized model,
for now, it won't run.
It won't crash, but the delegates, if given a model
that it doesn't support, will simply reject all the nodes
and everything will run on CPU.
So there'll be no improvement to performance at all.
The second question is about the trade-offs.
For example, a lot of the fixed-point model
or fixed-point delegates, such as the DSP delegate,
tend to sacrifice a bit of the accuracy
to gain a lot of speed, because of reasons
like using lower [INAUDIBLE] or fusing all of the operations
together.
So if your application requested a lot of precision,
this might be a problem for you.
Or with the GPU delegate, there is
an overhead in RAM usage during initialization time.
Also, all of the delegates come with some binary site
associated with them.
Except the NNAPI delegate, which ships with the TFLite
runtime by default. So you have to keep an eye
on the binary size increase when you use a delegate.
All these numbers are provided in the documentation,
so be sure to check it out before you apply a delegate.
And the last question, obviously,
is whether the delegate actually improves performance.
Now, this depends on a lot of different factors,
such as supporting ops.
If there are a lot of unsupported ops in your model,
there will be a lot of ops between the CPU
and the accelerator, which will result in more latency
sometimes.
So you have to take care of which ops are in your model.
Another factor is whether the environment
supports the delegate.
For example, if you give the Core ML delegate
a model on an old iPhone, it might not do you any good.
But the good news is that we have
some tools to help you to figure out
which delegate to use in any given
environment for your model.
We have our favorite benchmark_model tool,
which is used for latency profiling on Android devices.
So you basically build the binary using bezel,
and you push it to the device.
And then you can run it to get a lot of statistics
about latency performance.
This is kind of the output that you get with the tool.
It tells you whether it applied the delegate,
and then it gives you a bunch of statistics
about latency in microseconds.
It also sometimes tells you about the CPU memory usage.
So if that is important for you, you can check that out.
Then we also have the inference_diff tool
that we released recently, which is basically
a way to compare CPU performance or CPU
accuracy with the accelerator's performance.
So what it does is it runs the model
in two different environments.
One is a CPU, and the other is the accelerator.
And it does this for a bunch of runs with random data,
and it compares the output tensors at the end.
The result looks something like this,
where you get a structure of output that
is for each output tensor.
So if you know what your output tensors mean in your model,
then you have a good idea of how close the accelerator
performance is to the CPU.
And we also have our recently released profilers for Android,
which are a great way to dig down into how model behavior is
seen on Android devices.
Let's take an example.
Suppose you're using Perfetto, which
is a great tool for Android debugging.
You see that you have delegation occurring,
but you also see that the latency is higher than what
you would typically expect.
You zoom in and you see that there is a fully connected op
which is running after the delegate,
and you know that your delegate only supports one partition.
So it cannot delegate that fully connected op.
You dig in further, and you see that there
is a squeeze op between the delegate partition and fully
connected, which is causing this problem.
So if this op was supported on the delegate,
then the entire thing would run on the same partition.
And this is a real example of a ResNet with our GPU delegate.
So if this squeeze op was substituted with a reshape op,
the entire thing would run on the GPU delegate.
So this is how you can use profilers and a tool
like Perfetto to figure out why performance might not
be what you expect on an Android device.
In the works, we are working, in the next coming months,
on better tooling for delegates for you
to figure out how and why performance
is different from what you would expect.
We're also working on improved performance across all
our delegates and improved moral support with ops
and different kinds of models, such as 14 point and context
models.
And we are also working on revamping our documentation so
that you have better support for using and writing
your own delegates.
That's all.
You can look at our documentation
on /lite/performance/delegates for all things delegates,
the different options, and how to write your own.
If you have any questions, feel free to reach out
to us at tflite@tensorflow.org.
Thank you.
[MUSIC PLAYING]