Subtitles section Play video
[MUSIC PLAYING]
BOYA FANG: Hi, my name is Boya, and I'm a software engineer
at Google Lens.
And I'll be talking today about how
TensorFlow and, particularly, TF Lite
help us bring the powerful capabilities of Google Lens'
computer vision stack on device.
First, what is Google Lens?
And what is it capable of?
Lens is a mobile app that allows you to search what you see.
It takes input from your phone's camera
and uses advanced computer vision technology in order
to extract semantic information from image pixels.
As an example, you can point Lens
to your train ticket, which is in Japanese.
It will automatically translate it and show it to you
in English in the live viewfinder.
You can also use Lens to calculate the tip
at the end of dinner by simply pointing
your phone at the receipt.
Aside from just searching for answers
and providing suggestions, though, Lens
also integrates live AR experiences
into your phone's viewfinder, using optical motion tracking
and on-device rendering stack.
Just like you saw in the examples,
Lens utilizes the full spectrum of computer vision
capabilities, starting with image quality enhancements
such as de-noising and motion tracking in order
to enable AR experiences and then,
particularly, deep-learning models for object detection
and semantic understanding.
In order to give you a quick overview of Lens' computer
vision pipeline today, on the mobile client,
we select an image from the camera stream
to send to the server for processing.
On the server side, the query image
then gets processed using a stack of computer vision models
in order to extract text and object
information from the pixels.
These semantic signals are then used
to retrieve search results from our server-side index, which
then gets sent back to the client
and displayed to the user.
Lens' current computer vision architecture is very powerful,
but it has some limitations.
Because we send a lower resolution image to the server
in order to minimize the payload size of the query,
the quality of the computer vision prediction
is lowered due to the compression artifacts
and reduced image detail.
Also the queries are processed on a per-image basis, which
can sometimes lead to inconsistencies, especially
for visually similar objects.
You can see, on the right there, the moth
was misidentified by Lens as a chocolate
cake, just an example of how this may impact the user.
Finally, Lens aims to provide great answers
to all of our users instantly after opening the app.
We want Lens to work extremely fast and reliably
for all users, regardless of device type
and network connectivity.
The main bottleneck to achieving this vision
is the network round-trip time with the image payload.
To give you a better idea about how network latency impacts us,
in particular, it goes up significantly
with poorer connectivity as well as payload size.
In this graph, you can see latency
plotted against payload size with the blue bars representing
a 4G connection and the red a 3G connection.
For example, sending a 100 KB image on the 3G network
can take up to 2.5 seconds, which is very high from a user
experience standpoint.
In order to achieve our goal of less than one
second end-to-end latency for all Lens' users,
we're exploring moving server-side computer vision
models entirely on device.
In this new architecture, we can stop sending pixels
to the server by extracting text and object
features on the client side.
Moving machine learning models on device
eliminates the network latency.
But this is a significant shift from the way Lens currently
works, And implementing this change
is complex and challenging.
Some of the main technical challenges
are that mobile CPUs are much less powerful
than specialized, server-side, hardware architectures like
TPUs.
We've had some success importing server models
on device using deep-learning architectures optimized
for mobile CPUs, such as MobileNets,
in combination with quantizing for mobile hardware
acceleration.
Retraining models from scratch is also very time consuming,
but training strategies like transfer learning
and distillation significantly reduced model development time
by leveraging existing server models to teach a mobile model.
Finally, the models themselves need deployment infrastructure
that's inherently mobile efficient and manages
the trade off between quality, compute, latency, and power.
We have used TF Lite in combination
with mediapipe as an executor framework
in order to deploy and optimize our ML
pipeline for mobile devices.
Our high level developer workflow
to port a server model on device is to,
first, pick a mobile friendly architecture,
such as a MobileNet, then train the model using TensorFlow
training pipeline, distilling from a server model,
and then evaluating the performance by using
TensorFlow's evaluation tools.
And, finally, to save the train model at a checkpoint
that you like, and then converting the format to TF
Lite in order to deploy it on mobile.
Here's an example of how easy it is
by using TensorFlow's command line tools to convert
the saved model to TF Lite.
Switching gears a little bit, let's look
at an example of how Lens uses on-device computer vision
to bring helpful suggestions instantly to the user.
We can use on device ML in order to determine
if the user's camera's pointed out something that Lens
can help the user with.
You can see here, in this video, that,
when the user points at a block of text,
a suggestion chip is shown.
When pressed, it brings the user to Lens,
which then allow them to select the text
and use it to search the web.
In order to enable these kinds of experiences on device,
multiple visual signals are required.
In order to generate these signals,
Lens uses a cascade of text, barcode, and visual detection
models, which is implemented as a directed acyclic graph, some
of which can run in parallel.
The raw text, barcode, and object-detection signals
are further processed using various on-device annotators
and higher level semantic models such as fine-grained
classifiers and embedders.
This graph-based framework of models
allows Lens to understand the scene's content as well
as the user's intent.
To further help optimize for low latency, Lens on device
uses a set of inexpensive ML models,
which can be run within a few milliseconds on every camera
frame.
These perform functions like frame selection and course
classification in order to optimize for latency
and compute by carefully selecting when to run
the rest of the ML pipeline.
In summary, Lens can help improve
the user experience of all our users
by moving computer vision on device.
TF Lite and other TensorFlow tools
are critical in enabling this vision.
We can rely on cascading multiple models in order
to scale this vision to many device types
and tackle reliability and latency.
You, too, can add computer vision to your mobile product.
First, you can try Lens to get some inspiration for what
you could do, and then you could check out
the pre-trained, mobile models that TensorFlow publishes.
And you can also follow something like the MediaPipe
tutorial to help you build your own custom cascade.
Or you could build and deploy and integrate ML models
into your mobile app using something like ML Kit Firebase.
Thank you.
[MUSIC PLAYING]