Subtitles section Play video Print subtitles [MUSIC PLAYING] BOYA FANG: Hi, my name is Boya, and I'm a software engineer at Google Lens. And I'll be talking today about how TensorFlow and, particularly, TF Lite help us bring the powerful capabilities of Google Lens' computer vision stack on device. First, what is Google Lens? And what is it capable of? Lens is a mobile app that allows you to search what you see. It takes input from your phone's camera and uses advanced computer vision technology in order to extract semantic information from image pixels. As an example, you can point Lens to your train ticket, which is in Japanese. It will automatically translate it and show it to you in English in the live viewfinder. You can also use Lens to calculate the tip at the end of dinner by simply pointing your phone at the receipt. Aside from just searching for answers and providing suggestions, though, Lens also integrates live AR experiences into your phone's viewfinder, using optical motion tracking and on-device rendering stack. Just like you saw in the examples, Lens utilizes the full spectrum of computer vision capabilities, starting with image quality enhancements such as de-noising and motion tracking in order to enable AR experiences and then, particularly, deep-learning models for object detection and semantic understanding. In order to give you a quick overview of Lens' computer vision pipeline today, on the mobile client, we select an image from the camera stream to send to the server for processing. On the server side, the query image then gets processed using a stack of computer vision models in order to extract text and object information from the pixels. These semantic signals are then used to retrieve search results from our server-side index, which then gets sent back to the client and displayed to the user. Lens' current computer vision architecture is very powerful, but it has some limitations. Because we send a lower resolution image to the server in order to minimize the payload size of the query, the quality of the computer vision prediction is lowered due to the compression artifacts and reduced image detail. Also the queries are processed on a per-image basis, which can sometimes lead to inconsistencies, especially for visually similar objects. You can see, on the right there, the moth was misidentified by Lens as a chocolate cake, just an example of how this may impact the user. Finally, Lens aims to provide great answers to all of our users instantly after opening the app. We want Lens to work extremely fast and reliably for all users, regardless of device type and network connectivity. The main bottleneck to achieving this vision is the network round-trip time with the image payload. To give you a better idea about how network latency impacts us, in particular, it goes up significantly with poorer connectivity as well as payload size. In this graph, you can see latency plotted against payload size with the blue bars representing a 4G connection and the red a 3G connection. For example, sending a 100 KB image on the 3G network can take up to 2.5 seconds, which is very high from a user experience standpoint. In order to achieve our goal of less than one second end-to-end latency for all Lens' users, we're exploring moving server-side computer vision models entirely on device. In this new architecture, we can stop sending pixels to the server by extracting text and object features on the client side. Moving machine learning models on device eliminates the network latency. But this is a significant shift from the way Lens currently works, And implementing this change is complex and challenging. Some of the main technical challenges are that mobile CPUs are much less powerful than specialized, server-side, hardware architectures like TPUs. We've had some success importing server models on device using deep-learning architectures optimized for mobile CPUs, such as MobileNets, in combination with quantizing for mobile hardware acceleration. Retraining models from scratch is also very time consuming, but training strategies like transfer learning and distillation significantly reduced model development time by leveraging existing server models to teach a mobile model. Finally, the models themselves need deployment infrastructure that's inherently mobile efficient and manages the trade off between quality, compute, latency, and power. We have used TF Lite in combination with mediapipe as an executor framework in order to deploy and optimize our ML pipeline for mobile devices. Our high level developer workflow to port a server model on device is to, first, pick a mobile friendly architecture, such as a MobileNet, then train the model using TensorFlow training pipeline, distilling from a server model, and then evaluating the performance by using TensorFlow's evaluation tools. And, finally, to save the train model at a checkpoint that you like, and then converting the format to TF Lite in order to deploy it on mobile. Here's an example of how easy it is by using TensorFlow's command line tools to convert the saved model to TF Lite. Switching gears a little bit, let's look at an example of how Lens uses on-device computer vision to bring helpful suggestions instantly to the user. We can use on device ML in order to determine if the user's camera's pointed out something that Lens can help the user with. You can see here, in this video, that, when the user points at a block of text, a suggestion chip is shown. When pressed, it brings the user to Lens, which then allow them to select the text and use it to search the web. In order to enable these kinds of experiences on device, multiple visual signals are required. In order to generate these signals, Lens uses a cascade of text, barcode, and visual detection models, which is implemented as a directed acyclic graph, some of which can run in parallel. The raw text, barcode, and object-detection signals are further processed using various on-device annotators and higher level semantic models such as fine-grained classifiers and embedders. This graph-based framework of models allows Lens to understand the scene's content as well as the user's intent. To further help optimize for low latency, Lens on device uses a set of inexpensive ML models, which can be run within a few milliseconds on every camera frame. These perform functions like frame selection and course classification in order to optimize for latency and compute by carefully selecting when to run the rest of the ML pipeline. In summary, Lens can help improve the user experience of all our users by moving computer vision on device. TF Lite and other TensorFlow tools are critical in enabling this vision. We can rely on cascading multiple models in order to scale this vision to many device types and tackle reliability and latency. You, too, can add computer vision to your mobile product. First, you can try Lens to get some inspiration for what you could do, and then you could check out the pre-trained, mobile models that TensorFlow publishes. And you can also follow something like the MediaPipe tutorial to help you build your own custom cascade. Or you could build and deploy and integrate ML models into your mobile app using something like ML Kit Firebase. Thank you. [MUSIC PLAYING]
B1 lens computer vision vision device latency server TF Lite Lens (TF Dev Summit '20) 2 0 林宜悉 posted on 2020/04/04 More Share Save Report Video vocabulary