Subtitles section Play video Print subtitles TIM DAVIS: Dance Like enables you to learn how to dance on a mobile phone. CHRIS MCCLANAHAN: TensorFlow flow can take our smartphone camera and turn it into a powerful tool for analyzing body posts. ANDREW SELLE: We had a team at Google that had developed an advanced model for doing pose segmentation. So we're able to take their implementation, convert it into TensorFlow Lite. Once we had it there, we could use it directly. SHIKHAR AGARWAL: To run all the AI and machine learning models, to detect body part, it's a very computationally expensive process where we need to use the on device GPU, TensorFlow Library made it possible so that we can leverage all these resources-- the compute on the device-- and give a great user experience. ANDREW SELLE: Teaching people to dance is just the tip of the iceberg. Anything that involves movement would be a great candidate. TIM DAVIS: So that means people who have skills can teach other people those skills. And AI is just this layer that really just interfaces between the two things. When you empower people to teach people, I think that's really when you have something that is game changing. NUPUR GARG: When Tim originally did this, he did this in slow motion. We use these models that are running on device in order to speed up his dance performance to match the professional dancer. We also snapshotted a few motions in order to understand what motions he was doing well and what he needed to improve on applications like this can be used on device for educational purposes for not only dance but other applications as well. New cutting edge models are also pushing the boundaries of what's available on device. BERT is a method of pre-training language representations, which obtain state-of-the-art results on a wide array of natural language processing tasks. Today, we're launching MobileBERT. BERT has been completely re-architected to not only be smaller, but be faster without losing any accuracy. Running MobileBERT with TensorFlow Lite is 4.4 times faster on the CPU than BERT, and 77% smaller while maintaining the same accuracy. Let's take a look at a demo application. So this is a question and answer demo application that takes snippets from Wikipedia. It has a user ask questions on a particular topic, or it suggests a few preselected questions to ask. And then searches a text corpus for the answers to the questions all on device. We encourage you to take a look at both of these demo applications at our booth. So we've worked hard to bring these features of Dance Like and MobileBERT to your applications by making it easy to run machine learning models on device. In order to deploy on device, you first need to get a TensorFlow Lite model. Once you have the model, then you can load it into your application, transform the data in the way that the model requires it, run the model, and use the resulting output. In order to get the model, we've created a rich model repository. We've added many new models that can be utilized in your applications in production right now. These models include the basic models such as MobileNet and inception. It'll also include MobileBERT, Style Transfer, and DeepLab v3. Once you have your models, you can use our TensorFlow Lite Support Library that we're also launching this week. It's a new library for processing and transforming data. Right now, it's available for Android for image models. But we're working on adding support for iOS as well as additional types of models. The support library simplifies the pre-processing and post-processing logic on Android. This includes functions such as rotating the image 90 degrees or cropping the image. We're working on providing auto-generation APIs that target your specific model and provide APIs that are simple for your model. With our initial launch, as I mentioned, it'll be focused on image use cases. However, we're working on expanding the use cases to a broader range of models. So let's take a look at how this looks in code. So before a support library, in order to add TensorFlow Lite in your application, you needed to do all of this code, mostly doing data pre-processing and post-processing. However, with the use of the auto-generation support libraries, all of this code is simplified into five lines of code. The first two lines are loading the model. Then, you can load your image bit data into the model. And it'll transform the image as required. Next, you can run your model and it'll output a map of the string labels with the float probabilities. This is how the code will look with auto-generation APIs that we'll be launching later this year. One of the biggest frustrations with using models was not knowing the inputs and outputs of the models. Now, model authors can include this metadata with your model to have it available from the start. This is an example of a JSON file that the model author can package into the model. This will be launched with the auto-generate APIs. And all of our models in the model garden will be updated to have this metadata. In order to make it easy to use all of the models in our model garden and leverage the TF support library, we have added example applications for Android and iOS for all of the models, and made the applications use the TF Support Library wherever possible. We're also continuing to build out our demo applications on both the Raspberry Pi and as CPU. So now, what if your use case wasn't covered, either by our model garden or the support library? Revisiting all the use cases, there is a ton of use cases that we haven't talked about in those specific models that we listed. So the first thing you need to do is either find a model or generate a model yourself from TensorFlow APIs, either keras APIs or estimator APIs. Once you have a SavedModel, which is the unified file format for 2.0, you can take the model, pass it through the TensorFlow Lite Converter. And then you'll get a TensorFlow Lite flapper for model output. In code, it's actually very simple. You can generate your model, save it with one line, and use two lines of code to take in the same model and convert it. We also have APIs that directly convert keras models. All the details of those are available on our website. Over the last few months, we've worked really hard on improving our converter. We've added a new converter, which has better debug ability, including source file location identification. This means you can know exactly where in your code cannot be converted to TF Lite. We've also added support for Control Flow v2, which is the default control flow in 2.0. In addition, we're adding new operations, as well as support for new models, including Mask R-CNN, Faster R-CNN, MobileBert, and Deep Speech v2. In order to enable this new feature, all you have to do is set experimental new converter flag to true. We encourage everyone to participate in the testing process. We've planned to set this new converter as a default back end at some point in the future. So let's look at the debug ability of this new converter. So when running this model, it gives an error that TF reciprocal op is neither a custom op nor a flex op. Then it provides a stack trace, allowing you to understand where in the code this operation is called. And that way, you know exactly what line to address. Once you have your TF Lite model, it can be integrated into your application the same way as before. You have to load the model, pre-process it, run it, and use the resulting output. Let's take a look at a paradigm version of this code in Kotlin. So the first two lines, you have to load the model. And then you have to run it through our interpreter. Once you have loaded the model, then you need to initialize the input array and the output array. The input should be a byte buffer. And the output array needs to contain all of the probabilities. So it's a general float array. Then you can run it through the interpreter and do any post-processing as needed. To summarize these concepts, you have the converter to generate your model and the interpreter to run your model. The interpreter calls into op kernels and delegates, which I'll talk about in detail in a bit. And guess what? You can do all of this in a variety of language bindings. We've released a number of new, first class language bindings, including Swift and Objective C for iOS, C# for Unity developers, and C for native developers on any platform. We've also seen a creation of a number of community-owned language bindings for Rust, Go, and Dart. Now that we've discussed how TensorFlow Lite works at a high level, let's take a closer look under the hood. One of the first hurdles developers face when deploying models on device is performance. We've worked very hard, and we're continuing to work hard on making this easy out of the box. We worked on improvements on the CPU, GPU, and many custom hardwares, as well as adding tooling to make it easy to improve your performance. So this slide shows TF Lite's performance at Google I/O in May. Since then, we've had a significant performance improvement across the board, from float models on the CPU to models on the GPU. Just to reemphasize how fast this is, a floor model for MobileNet v1 takes 37 milliseconds to run on the CPU. If you quantize that model, it takes only 13 milliseconds on the CPU. On the GPU, a float model takes six milliseconds. And on the Edge TPU, in quantized fixed point, it takes two milliseconds. Now let's discuss some common techniques to improve the model performance. There is five main approaches in order to do this-- use quantization, pruning, leverage hardware accelerators, use mobile optimized architectures, and per-op profiling. The first way to improve performance is use quantization. Quantization is a technique used to reduce the position of static parameters, such as weights, and dynamic values, such as activations. For most models, training at inference useful at 32. However, in many use cases, using int 8 or float 16 instead of float 32 improves latency without a significant decrease to accuracy. Using quantization enables many hardware accelerators that only support 8-bit computations. In addition, it allows additional acceleration on the GPU, which is able to do two float 16 computations for one Float 32 computation. We provide a variety of techniques for performing quantization as part of the model optimization toolkit. Many of these techniques can be performed after training for ease of use. The second technique for improving model performance is pruning. During model pruning, we set unnecessary weight values to zero. By doing this, we're able to remove what we believe are unnecessary connections between layers of a neural network. This is done during the training process in order to allow the neural network to adapt to the changes. The resulting weight tensor is will have a lot more zeros, and therefore will increase the sparsity of the model. With the addition of sparse tensor representations, the memory band width of the kernels can be reduced, and faster kernels can be implemented for the CPU and custom hardware. For those who are interested, Raziel will be talking about pruning and quantization in-depth after lunch in the Great American Ballroom. Revisiting the architecture diagram more closely, the interpreter calls into op kernels and delegates. The op colonels are highly optimized for the ARM Neon instruction set. And the delegates allow you to access accelerators, such as the GPU, DSP, and Edge TPU. So let's see how that works. Delegates allow part or entire parts of the graph to execute on specialized hardware instead of the CPU. In some cases, some operations may not be supported by the accelerator. So portions of that graph that can be offloaded for acceleration are delegated. And remaining portions of the graph are run on the CPU. However, it's important to note that when the graph is delegated into too many components, then it can slow down the graph execution in some cases. The first delegate we'll discuss is the GPU delegate, which enables faster execution for float models. It's up to seven times faster than the floating point CPU implementations. Currently, the GPU delegate uses OpenCL when possible, or otherwise OpenGL on Android. And uses Metal on iOS. One trade-off with delegates is the increase to the binary size. The GPU delegate adds about 250 kilobytes to the binary size. The next delegate is a Qualcomm Hexagon DSP delegate. In order to support a greater range of devices, in especially in mid to low-tier devices, we have worked with Qualcomm to develop a delegate for the hexagon chipset. We recommend using the hexagon delegate on devices Android O and below, and the NN API delegate, which I'll talk about next, on devices Android P and above. This delegate accepts integer models and increases the binary size by about two megabytes. And it'll be launching soon. Finally, we have the NN API delegate, or the Neural Network API. The NN API delegate supports over 30 ops on the Android P, and over 90 ops on Android Q. This delegate accepts both float and integer models. And it's built into Android devices and therefore has no binary size increase. The code for all the delegates is very similar. All you have to do is create the delegate and add it to the TF Lite options for the interpreter when using it. Here's an example with a GPU delegate. And here's an example with an NN API delegate. The next way to improve performance is to choose a suitable model with a suitable model architecture. For many image classification tasks, people generally use Inception. However, when doing on device, MobileNet is 15 times faster and nine times smaller. And therefore, it's important to investigate the trade-off between the accuracy and the model performance and size. This applies to other applications as well. Finally, you want to ensure that you're benchmarking and validating all of your models. We offer simple tools to enable this for per-op profiling, which helps determine which ops are taking the most computation time. This slide shows a way to execute the per-op profiling tool through the command line. This is what our tool will output when you're doing per-op profiling for a model. And it enables you to narrow down your graph execution and go back and tune performance bottlenecks. Beyond performance, we have a variety of techniques relating to op coverage. The first allows you to utilize TensorFlow ops that are not natively supported in TF Lite. And the second allows you to reduce your binary size if you only want to include a subset of ops. So one of the main issues that users face when converting a model from TensorFlow to TensorFlow Lite is unsupported ops. TF Lite has native implementations for a subset of the TensorFlow ops that are optimized for mobile. In order to increase op coverage, we have added a feature called TensorFlow Lite Select, which adds support for many of the TensorFlow ops. The one trade-off is that it can increase binary size by six megabytes, because we're pulling in the full TensorFlow runtime. This is a code snippet showing how you can use TensorFlow Lite Select. You have to set the target_spec.supported_ops to include both built-in and select ops. So built-in ops will be used when possible in order to utilize optimized kernels. And select ops will be used in all other cases. On the other hand, for TF Lite developers who deeply care about their binary footprint, we've added a technique that we call selective registration, which only includes the ops that are required by the model. Let's take a look at how this works in code. You create a custom op resolver that you use in place of TF Lite to build an op resolver. And then in your build file, you specify your model and the custom op resolver that you created. And TF Lite will scan over your model and create a registry of ops contained within your model. When you build the interpreter, it'll only include the ops that are required by your model, therefore reducing your overall binary size. This technique is similar to the technique that's used to provide support for custom operations, which are user-provided implementations for ops that we do not support as built-in ops. And next, we have Pete talking about microcontrollers. PETE WARDEN: As you've seen, TensorFlow has had a lot of success in mobile devices, like Android and iOS. We're in over three billion devices in production. Oh, I might actually have to switch back to-- let's see-- yes, there we go. So what is really interesting, though, is that there were actually over 250 billion microcontrollers out there in the world already. And you might not be familiar with them because they tend to hide in plain sight. But these are things that you get in your cars and your washing machines, in almost any piece of electronics these days. They are extremely small. They only have maybe tens of kilobytes of RAM and Flash to actually work with. They often don't have a proper operating system. They definitely don't have anything like Linux. And they are incredibly resource-constrained. And you might think, OK, I've only got tens of kilobytes of space. What am I going to be able to do with this? A classic example of using microcontrollers is actually-- and you'll have to forgive me if anybody's phone goes off-- but, OK Google. That's driven by a microcontroller that runs always on DSP. And the reason that it's running on a DSP, even though you have this very powerful ARM CPU sitting there is that a DSP only uses tiny amounts of battery. And if you want your battery to last for more than an hour or so, you don't want the CPU on all the time. You need something that's going to be able to sit there and sip almost no power. So the setup that we tend to use for that is you have a small, comparatively low accuracy model that's always running on this very low energy DSP that's listening out for something that might sound a bit like OK Google. And then if it thinks it's heard that, it actually wakes up the main CPU, which is much more battery hungry, to run an even more elaborate model to just double check that. So you're actually able to get this cascade of deep learning models to try and detect things that you're interested in. And this is a really, really common pattern. Even though you might not be able to do an incredibly accurate model on a microcontroller or a DSP, if you actually have this kind of architecture, it's very possible to do really interesting and useful applications and keep your battery life actually alive. So we needed a framework that would actually fit into this tens of kilobytes of memory. But we didn't want to lose all of the advantages we get from being part of this TensorFlow Lite ecosystem and this whole TensorFlow ecosystem. So what we've actually ended up doing is writing an interpreter that fits within just a few kilobytes of memory, but still uses the same APIs, the same kernels, the same file buffer format that you use with regular TensorFlow Lite for mobile. So you get all of these advantages, all of these wonderful tooling things that Nupur was just talking about that are coming out. But you actually get to deploy on these really tiny devices. [VIDEO PLAYBACK] - Animation. OK, so now it's ready. And so it even gives you instructions. So instead of listening constantly, which we thought some people don't like the privacy side effects of it, is you have to press the button A here. And then you speak into this microphone that I've just plugged into this [INAUDIBLE] port here. It's just a standard microphone. And it will display a video and animation and audio. So let's try it out. I'm going to press A and speak into this mic. Yes. Moo. Bam. - You did it. - Live demo. - So that's what we wanted to show. And it has some feedback on the screen. It shows the version, it shows what we're using. And this is all hardware that we have now-- battery power, [INAUDIBLE] power. - Yes, yes, yes, yes, yes. This is all battery powered. [END PLAYBACK] PETE WARDEN: So what's actually happening though is it plays an animation when she says the word yes, because it's recognized that. There's actually an example of using TensorFlow Lite for microcontrollers, which is able to recognize simple words like yes or no. And it's really a tutorial on how you can create something that's very similar to the OK Google model that we've run on DSP and phones to recognize short words, or even do things like recognize, if you want to recognize breaking glass, if you want to recognize any other audio noises, there's a complete tutorial that you can actually grab and then you could deploy on these kind of microcontrollers. And if you're lucky and you stop by the TensorFlow Lite booth, we might even have a few of these microcontrollers left to give away from Adafruit. So I know some of you out there in the audience already have that box, but thanks to the generosity of ARM, we've actually been able to hand some of those out. So come by and check that out. So let's see if I can actually-- yes. And the other good thing about this is that you can use this on a whole variety of different microcontrollers. We have an official Arduino library. So if you're using the Arduino IDE, you can actually grab it immediately. Again, AV-- much harder than AI. Let's see. So we'll have the slides available. So you can grab them. But we actually have a library that you can grab directly through the Arduino IDE. And you just choose it like you would any other library, if you're familiar with that. But we also have it available through systems like Mbed, if you're used to that on the ARM devices. And through places like SparkFun and Adafruit, you can actually get boards. And what this does-- you'll have to trust me because you won't be able to see the LED. But if I do a W gesture, it lights up the red LED. If I do an O, it lights up the blue LED. Some of you in the front may be able to vouch for me. And then if I do an L-- see if I get this right-- it lights up the yellow LED. As you can tell, I'm not an expert wizard. We might need to click on the play focus. Let's see this. Fingers crossed. Yay. I'm going to skip past-- oh my god, we have audio. This is amazing. It's a Halloween miracle. Awesome. So, yes, you can see here-- Arduino. A very nice video from them. And they have some great examples out there too. You can just pick up their board and get running in a few minutes. It's pretty cool. And as I was mentioning with the magic wand, here we're doing an accelerometer gesture recognition. You can imagine there's all sorts of applications for this. And the key thing here is this is running on something that's running on a coin battery, and can run on a coin battery for days or even weeks or months, if we get the power optimization right. So this is really the key to this ubiquitous ambient computing that you might be hearing a lot about. And what other things can you do with these kind of MCUs? They are really resource limited. But you can do some great things like simple speech recognition, like we've shown. We have a demo at the booth of doing person detection using a 250 kilobyte MobileNet model that just detects whether or not there's a person in front of the camera, which is obviously super useful for all sorts of applications. We also have predictive maintenance, which is a really powerful application. If you think about machines in factories, even if you think about something like your own car, you can tell when it's making a funny noise. And you might need to take it to the mechanics. Now if you imagine using machine learning models on all of the billions of machines that are running in factories and industry all around the world, you can see how powerful that can actually be. So as we mentioned, we've got these examples out there now as part of TensorFlow Lite that you can run on Arduino, SparkFun, Adafruit, all these kinds of boards, recognizing yes/no with the ability to retrain using TensorFlow for your own words you care about. Doing person detection is really interesting because we've trained it for people, but it will actually also work for a whole bunch of other objects in the COCO data set. So if you want to detect cars instead of people, it's very, very easy to just re-target it for that. And gesture recognition. We've been able to train it to recognize these kinds of gestures. Obviously, if you have your own things that you want to recognize through accelerometers, that's totally possible to do as well. So one of the things that's really helped us do this has been our partnership with ARM, who designed all the devices that we've actually been showing today. So maybe the ARM people up front, if you can just give a wave so people can find you. And thank you. They've actually been contributing a lot of code. And this has been a fantastic partnership for us. And stay tuned for lots more where that came from. So that's it for the microcontrollers. Just to finish up, I want to cover a little bit about where TensorFlow Lite is going in the future. So what we hear more than anything is people want to bring more models to mobile and embedded devices. So more ops, more supported models. They want their models to run faster. So we're continuing to push on performance improvements. They want to see more integration with TensorFlow and things like TensorFlow Hub. And easier usage of all these, which means better documentation, better examples, better tutorials. On-device training and personalization is a really, really interesting area where things are progressing. And we also really care about trying to figure out where your performance is going, and actually trying to automate the process of profiling and optimization, and helping you do a better job with your models. And to help with all of that, we also have a brand new course that's launched on Udacity aimed at TensorFlow Lite. So please check that out. So that's it from us. Thank you for your patience through all of the technical hiccups. I'm happy to answer any of your questions. I think we're going to be heading over to the booth after this. So we will be there. And you can email us at tflite@tensorflow.org if you have anything that you want to ask us about. So thank you so much. I'll look forward to chatting. [APPLAUSE]
B1 model lite tensorflow lite delegate ops tf TensorFlow Lite: Solution for running ML on-device (TF World '19) 2 0 林宜悉 posted on 2020/03/31 More Share Save Report Video vocabulary