Subtitles section Play video Print subtitles [MUSIC PLAYING] SACHIN JOGLEKAR: Hey. I'm Sachin from the TensorFlow Lite team, and I'm here to talk about delegates. Before I go into the details, I would like to go over some of the basics of what delegation is. Typically, a user would start with a TensorFlow model and use a converter to convert the model into the TFLite format. This TFLite file would then be handed down to our interpreter, which runs the model on device. By default, models run in the CPU. So the interpreter would call out to our CPU Op Kernels that are highly optimized for the ARM Neon instruction set. However, most devices these days, especially mobile phones, have a lot of other chips, like mobile GPUs or DSPs. And this is where delegates come in. Our Delegate API acts like a bridge between the TensorFlow Lite runtime and lower level accelerated APIs. For example, our NNAPI delegate acts as an interface between TensorFlow Lite and Android's neural network API. Or the GPU delegate uses OpenCL and OpenGL to run inference on mobile GPUs on Android devices. A natural question here is why would you use delegates at all. The most obvious benefit is foster inference. The classic example here is the GPU delegate. Because of the highly parallelized nature of the GPU, it is very good at performing matrix math, such as convolutions or fully connected layers. As a result, when we use our GPU delegate with TensorFlow Lite, we observe up to seven speed ups with a lot of division models that are currently used on mobile devices. Another great benefit is lower power consumption. A good example here is the DSP, or the digital signal processor. DSPs are meant for applications such as multimedia and communication, which inherently require less power consumption. So when you use a DSP for inference, it consumes up to 70% less power, which is what we observed when we used our delegates that leverage Qualcomm's Hexagon DSP to run even some of the mobile optimized models, such as the MobileNet or the MobileNet SSD. Now, suppose you have your own secret accelerator, and you want to use our delegate API to write your own delegate. Let's see how it would work in code. So the bulk of how the interpreter delegates nodes is in this function that we like to call DelegatePrepare. This function gets an object called the TFLite context, which is essentially an interface into the TensorFlow Lite runtime for the delegate. Using the context, the delegate first gets the execution plan, which is nothing but a list of nodes that are going to be executed in sequence. For each node, the delegate can look at different kinds of information, such as what op it executes or what are the types of input of tensors or the shapes. This has the delegate make an informed decision about which ops it can accept to be delegated. Once this list of supported nodes is populated, the delegate calls a function called ReplaceNodeSubse tsWithDelegateKernels. This function takes two main arguments. One is this list of supported nodes, and the other is what we call the kernel administration. We'll get to that in a minute. But let's look at what the runtime does when the delegate calls this method. Here we have a simple example of a model of Add and Mul ops, and let's say our delegate only supports the Add operation. So once the delegate calls this method with the nodes, the runtime partitions all the nodes into two subsets, or two types of partitions. One is delegated, and the other is non-delegated. Now, there are two reasons why this happens. First is many of the delegates optimize inference by fusing a lot of the ops together. So this way, the delegate can maximize the fusings and fuse as many ops as possible. Another great reason is that it's very expensive to go back and forth between the CPU and the accelerator, especially due to memory transfers. And therefore, the lesser the number of partitions, the more optimized the inference becomes. Once this partitioning is done, the TensorFlow Lite runtime replaces each delegated partition with one single delegate op. At this point, the delegate op behaves just like any other TFLite node for the runtime. And the behavior of this delegate up is what is defined by the kernel implementation, or the TFLite illustration that we saw in the previous slide. The two main methods that need to be implemented for this kernel administration-- the first one is Init. This method is run at initialization time. That is when delegation is happening. Here the delegate, or the delegate kernel, gets a list of delegate patterns, which is essentially nodes that it is responsible for, and parallel tensors, and information like that. With this information, the delegate is free to initialize any opaque object that it can create. The return type is void* so the runtime is completely agnostic to what type of object is written, as long as it is not null. Then at inference time, we run this method called Invoke. In Invoke, the kernel gets back the object returned during Init, and it is free to do whatever it wants to as long as the implementation is semantically similar to what the delegated partition would have done. Now that we know what happens under the hood, let's look at some of the delegation options in TFLite. The first delegate that we have is the NNAPI delegate, which supports a lot of different accelerators, such as the DSP, GPU, and NPU. It draws a variety of vendors. It runs on Android P and above. It supports more than 30 ops on Android P and over 90 ops on Android Q. This is one of the very few delegates that accepts both floating point and integer models. This is how you would typically run inference with the NNAPI delegate using our Java interface. The main idea is that you initialize the delegate instance and you pass it on to our interpreters. And the rest of your business logic remains pretty much the same. There's not much else you have to do for delegates, apart from just these couple of lines of initialization and cleanup at the end. Then we have our GPU delegate, which, as you mentioned before, gives up to seven times speed up on a lot of the version models that involve a lot of convolutions and fully connected layers. It uses OpenCL and OpenGL on Android and Metal for iOS. Currently it only accepts 14 point models, both 16-bit and 32-bit. We are working to add Vulkan support to the GPU delegate, as well as inference for Quantized Models. So stay tuned for that. This is how you would do things with a GPU delegate. The thing to note is that apart from the class name, which is GPU delegate, instead of an API delegate, everything else pretty much remains the same. We are excited to announce the release of the Qualcomm Hexagon DSP delegate that we announced a couple of weeks back. This delegate provides up to 25 speed up for quantized uint8 models. Our general directive is to use this delegate on Android O and below, and use NNAPI delegate on Android P and above, or in environments where you may not have the Android operating system. We are working with Qualcomm to add support for models which are per-channel quantized. So you can make use of our post-channel quantization tooling to run those same models with the Hexagon delegate. Again, the inference is pretty similar. The only difference is now that you have to initialize the Hexagon delegate instead of the GPU delegate object. We are so excited to announce the Core ML delegate, which uses Apple's neural engine to run faster inference on iOS devices. It runs on the A12 SoC and above, and it provides up to 14x speedup on a lot of the mobile models that are used in on-device inference. It is available on iOS 11 and later. This is how you would run the inference with the Core ML delegate using Swift, which is the language of choice for iOS development. The basic idea remains the same, that you initialize the object and then you pass it on to an interpreter, with the rest of the logic, apart from the inference, remaining the same. Of course, it's not as easy as taking any random model and giving it any delegate. You have to think about a few questions before you choose a delegate to use with your model. So the first consideration is whether the model is supported on the delegate. For example, if you pass a floating point model to the DSP delegate, nothing will happen. Or if you give the GPU delegate a quantized model, for now, it won't run. It won't crash, but the delegates, if given a model that it doesn't support, will simply reject all the nodes and everything will run on CPU. So there'll be no improvement to performance at all. The second question is about the trade-offs. For example, a lot of the fixed-point model or fixed-point delegates, such as the DSP delegate, tend to sacrifice a bit of the accuracy to gain a lot of speed, because of reasons like using lower [INAUDIBLE] or fusing all of the operations together. So if your application requested a lot of precision, this might be a problem for you. Or with the GPU delegate, there is an overhead in RAM usage during initialization time. Also, all of the delegates come with some binary site associated with them. Except the NNAPI delegate, which ships with the TFLite runtime by default. So you have to keep an eye on the binary size increase when you use a delegate. All these numbers are provided in the documentation, so be sure to check it out before you apply a delegate. And the last question, obviously, is whether the delegate actually improves performance. Now, this depends on a lot of different factors, such as supporting ops. If there are a lot of unsupported ops in your model, there will be a lot of ops between the CPU and the accelerator, which will result in more latency sometimes. So you have to take care of which ops are in your model. Another factor is whether the environment supports the delegate. For example, if you give the Core ML delegate a model on an old iPhone, it might not do you any good. But the good news is that we have some tools to help you to figure out which delegate to use in any given environment for your model. We have our favorite benchmark_model tool, which is used for latency profiling on Android devices. So you basically build the binary using bezel, and you push it to the device. And then you can run it to get a lot of statistics about latency performance. This is kind of the output that you get with the tool. It tells you whether it applied the delegate, and then it gives you a bunch of statistics about latency in microseconds. It also sometimes tells you about the CPU memory usage. So if that is important for you, you can check that out. Then we also have the inference_diff tool that we released recently, which is basically a way to compare CPU performance or CPU accuracy with the accelerator's performance. So what it does is it runs the model in two different environments. One is a CPU, and the other is the accelerator. And it does this for a bunch of runs with random data, and it compares the output tensors at the end. The result looks something like this, where you get a structure of output that is for each output tensor. So if you know what your output tensors mean in your model, then you have a good idea of how close the accelerator performance is to the CPU. And we also have our recently released profilers for Android, which are a great way to dig down into how model behavior is seen on Android devices. Let's take an example. Suppose you're using Perfetto, which is a great tool for Android debugging. You see that you have delegation occurring, but you also see that the latency is higher than what you would typically expect. You zoom in and you see that there is a fully connected op which is running after the delegate, and you know that your delegate only supports one partition. So it cannot delegate that fully connected op. You dig in further, and you see that there is a squeeze op between the delegate partition and fully connected, which is causing this problem. So if this op was supported on the delegate, then the entire thing would run on the same partition. And this is a real example of a ResNet with our GPU delegate. So if this squeeze op was substituted with a reshape op, the entire thing would run on the GPU delegate. So this is how you can use profilers and a tool like Perfetto to figure out why performance might not be what you expect on an Android device. In the works, we are working, in the next coming months, on better tooling for delegates for you to figure out how and why performance is different from what you would expect. We're also working on improved performance across all our delegates and improved moral support with ops and different kinds of models, such as 14 point and context models. And we are also working on revamping our documentation so that you have better support for using and writing your own delegates. That's all. You can look at our documentation on /lite/performance/delegates for all things delegates, the different options, and how to write your own. If you have any questions, feel free to reach out to us at tflite@tensorflow.org. Thank you. [MUSIC PLAYING]
B2 delegate android inference tflite ops model Accelerate models with TFLite Delegates (TF Dev Summit '20) 4 0 林宜悉 posted on 2020/04/04 More Share Save Report Video vocabulary