Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • BOYA FANG: Hi, my name is Boya, and I'm a software engineer

  • at Google Lens.

  • And I'll be talking today about how

  • TensorFlow and, particularly, TF Lite

  • help us bring the powerful capabilities of Google Lens'

  • computer vision stack on device.

  • First, what is Google Lens?

  • And what is it capable of?

  • Lens is a mobile app that allows you to search what you see.

  • It takes input from your phone's camera

  • and uses advanced computer vision technology in order

  • to extract semantic information from image pixels.

  • As an example, you can point Lens

  • to your train ticket, which is in Japanese.

  • It will automatically translate it and show it to you

  • in English in the live viewfinder.

  • You can also use Lens to calculate the tip

  • at the end of dinner by simply pointing

  • your phone at the receipt.

  • Aside from just searching for answers

  • and providing suggestions, though, Lens

  • also integrates live AR experiences

  • into your phone's viewfinder, using optical motion tracking

  • and on-device rendering stack.

  • Just like you saw in the examples,

  • Lens utilizes the full spectrum of computer vision

  • capabilities, starting with image quality enhancements

  • such as de-noising and motion tracking in order

  • to enable AR experiences and then,

  • particularly, deep-learning models for object detection

  • and semantic understanding.

  • In order to give you a quick overview of Lens' computer

  • vision pipeline today, on the mobile client,

  • we select an image from the camera stream

  • to send to the server for processing.

  • On the server side, the query image

  • then gets processed using a stack of computer vision models

  • in order to extract text and object

  • information from the pixels.

  • These semantic signals are then used

  • to retrieve search results from our server-side index, which

  • then gets sent back to the client

  • and displayed to the user.

  • Lens' current computer vision architecture is very powerful,

  • but it has some limitations.

  • Because we send a lower resolution image to the server

  • in order to minimize the payload size of the query,

  • the quality of the computer vision prediction

  • is lowered due to the compression artifacts

  • and reduced image detail.

  • Also the queries are processed on a per-image basis, which

  • can sometimes lead to inconsistencies, especially

  • for visually similar objects.

  • You can see, on the right there, the moth

  • was misidentified by Lens as a chocolate

  • cake, just an example of how this may impact the user.

  • Finally, Lens aims to provide great answers

  • to all of our users instantly after opening the app.

  • We want Lens to work extremely fast and reliably

  • for all users, regardless of device type

  • and network connectivity.

  • The main bottleneck to achieving this vision

  • is the network round-trip time with the image payload.

  • To give you a better idea about how network latency impacts us,

  • in particular, it goes up significantly

  • with poorer connectivity as well as payload size.

  • In this graph, you can see latency

  • plotted against payload size with the blue bars representing

  • a 4G connection and the red a 3G connection.

  • For example, sending a 100 KB image on the 3G network

  • can take up to 2.5 seconds, which is very high from a user

  • experience standpoint.

  • In order to achieve our goal of less than one

  • second end-to-end latency for all Lens' users,

  • we're exploring moving server-side computer vision

  • models entirely on device.

  • In this new architecture, we can stop sending pixels

  • to the server by extracting text and object

  • features on the client side.

  • Moving machine learning models on device

  • eliminates the network latency.

  • But this is a significant shift from the way Lens currently

  • works, And implementing this change

  • is complex and challenging.

  • Some of the main technical challenges

  • are that mobile CPUs are much less powerful

  • than specialized, server-side, hardware architectures like

  • TPUs.

  • We've had some success importing server models

  • on device using deep-learning architectures optimized

  • for mobile CPUs, such as MobileNets,

  • in combination with quantizing for mobile hardware

  • acceleration.

  • Retraining models from scratch is also very time consuming,

  • but training strategies like transfer learning

  • and distillation significantly reduced model development time

  • by leveraging existing server models to teach a mobile model.

  • Finally, the models themselves need deployment infrastructure

  • that's inherently mobile efficient and manages

  • the trade off between quality, compute, latency, and power.

  • We have used TF Lite in combination

  • with mediapipe as an executor framework

  • in order to deploy and optimize our ML

  • pipeline for mobile devices.

  • Our high level developer workflow

  • to port a server model on device is to,

  • first, pick a mobile friendly architecture,

  • such as a MobileNet, then train the model using TensorFlow

  • training pipeline, distilling from a server model,

  • and then evaluating the performance by using

  • TensorFlow's evaluation tools.

  • And, finally, to save the train model at a checkpoint

  • that you like, and then converting the format to TF

  • Lite in order to deploy it on mobile.

  • Here's an example of how easy it is

  • by using TensorFlow's command line tools to convert

  • the saved model to TF Lite.

  • Switching gears a little bit, let's look

  • at an example of how Lens uses on-device computer vision

  • to bring helpful suggestions instantly to the user.

  • We can use on device ML in order to determine

  • if the user's camera's pointed out something that Lens

  • can help the user with.

  • You can see here, in this video, that,

  • when the user points at a block of text,

  • a suggestion chip is shown.

  • When pressed, it brings the user to Lens,

  • which then allow them to select the text

  • and use it to search the web.

  • In order to enable these kinds of experiences on device,

  • multiple visual signals are required.

  • In order to generate these signals,

  • Lens uses a cascade of text, barcode, and visual detection

  • models, which is implemented as a directed acyclic graph, some

  • of which can run in parallel.

  • The raw text, barcode, and object-detection signals

  • are further processed using various on-device annotators

  • and higher level semantic models such as fine-grained

  • classifiers and embedders.

  • This graph-based framework of models

  • allows Lens to understand the scene's content as well

  • as the user's intent.

  • To further help optimize for low latency, Lens on device

  • uses a set of inexpensive ML models,

  • which can be run within a few milliseconds on every camera

  • frame.

  • These perform functions like frame selection and course

  • classification in order to optimize for latency

  • and compute by carefully selecting when to run

  • the rest of the ML pipeline.

  • In summary, Lens can help improve

  • the user experience of all our users

  • by moving computer vision on device.

  • TF Lite and other TensorFlow tools

  • are critical in enabling this vision.

  • We can rely on cascading multiple models in order

  • to scale this vision to many device types

  • and tackle reliability and latency.

  • You, too, can add computer vision to your mobile product.

  • First, you can try Lens to get some inspiration for what

  • you could do, and then you could check out

  • the pre-trained, mobile models that TensorFlow publishes.

  • And you can also follow something like the MediaPipe

  • tutorial to help you build your own custom cascade.

  • Or you could build and deploy and integrate ML models

  • into your mobile app using something like ML Kit Firebase.

  • Thank you.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it

B1

TF Lite Lens (TF Dev Summit '20) (TF Lite Lens (TF Dev Summit '20))

  • 2 0
    林宜悉 posted on 2021/01/14
Video vocabulary