Subtitles section Play video
[MUSIC PLAYING]
MATTHIAS GRUNDMANN: Hi, I'm Matthias.
And I'm here today to talk about machine learning
solutions for live perception.
What do we mean by live perception?
Basically any kind of machine learning that happens
live in the viewfinder, that is, on device in real time
with low latency.
There's an abundance of applications, for example,
virtual makeup and glasses, self-expression effects,
like Duo or Snapchat, or gesture control
and smart framing on devices like the Nest
device or the Portal device.
Live perception has various benefits.
Because we run all the machine learning on device
without any connection to the internet,
we also don't need to send any data up
to service, meaning it is inherently privacy conscious.
And because we don't need a network connection,
our results come also in immediately
and enables new use cases like creating live
in the viewfinder.
So this sounds great.
But what are the technical challenges?
Well, if you want to run machine learning in the viewfinder,
you really only have a budget of something
like 20 milliseconds per frame.
Because usually, you also want to run additional stuff
like rendering on top of that.
And that means you are constrained to low capacity
networks.
Classical approaches like distillation
tend to lead to low quality results.
And so you have to build machine learning models
with live perception in mind.
It's not just the machine learning models,
but it's also the infrastructure that you need to use.
And they go hand in hand.
And by using cutting edge cutting edge acceleration
techniques like TensorFlow lite, GPU or TensorFlow
GS where if it's WebGL or [INAUDIBLE] back ends,
you can achieve real time performance.
In addition, there's a framework called MediaPipe,
which is allowing us to run ML solutions, usually
multiple models, directly with GPU support and synchronization
support.
Speaking of synchronization, in live perception,
you do have a natural latency coming
from that ML inference as well as the inherent camera pipeline
latency.
And so this tends to add up to something
like 100 to 200 milliseconds, like you see on the GIF
in the right.
And a framework like MediaPipe helps you
with synchronization and buffering
to address those challenges.
So in today's talk, I'm going to talk
about three solutions, face meshes,
hand poses and object detection in 3D.
You'll see that we have a couple of recipes for live perception
that we recurrent across those solutions
and that, as number one, we use ML pipelines composed
of several smaller models versus one big model.
Our models are tightly coupled together.
And that will allow us to reduce augmentations.
And then we also heavily reduce the output layers
and we favor regression approaches over heat maps.
Let's jump in and with our first ML solution.
And that is MediaPipe FaceMesh, Google's Mobile 3D FaceMesh,
driving applications like YouTube, Duo, and 3P-APIs
like ML and ARCore.
It's a quite difficult problem to trying
to predict the high fidelity face
geometry without the use of dedicated hardware,
like a depth sensor.
Because we're predicting 486 points,
all the classical approaches like heat maps
don't really apply.
And then we have to achieve high quality.
For applications like virtual makeup,
we really want to be able to track faithfully
the lip contour so that the results look realistically.
MediaPipe FaceMesh is accelerated
with TensorFlow Lite GPU.
And it's available to developers via tf.js.
Here are some of the applications
that I just mentioned.
You see on the left AR ads in YouTube,
then self expression effects and Duo and more expression effects
in YouTube stories.
So how does it work?
The face mesh is modeled as a MediaPipe pipeline,
consisting of a BlazeFace face detector,
locating where the face is.
And then we crop that location and run on that 3D face mesh
network, that actually then returns 486 landmarks.
So the nice thing about coupling it as a pipeline
is that we actually don't need to run
the detector on every frame.
But instead, we can often reuse the location
that we computed in the previous frame.
So we only need to run BlazeFace on the first frame
or when there is a tracking error.
Let's look a little bit more into the details of BlazeFace.
So, as I mentioned, on some frames,
we have to run a detector.
And we have to run the tracking model.
And so our detector needs to be fast, blazingly fast.
We built a face detector that on some phones
locates faces in less than one millisecond.
And it's optimized for a set of use cases.
When we compare it to a bigger face detector,
then you see it has roughly the same precision
on self-use cases.
But it's significantly faster by nearly a factor of 4.
BlazeFace is available to developers via TF.js, TF Lite
and MediaPipe.
And developers have been using it for a couple of months.
And their response has been extremely positive.
They're telling us that it performs better
than the default solution that's available iOS.
We also provide a comprehensive ML Fairness variation
with our release.
So you can see how BlazeFace performs
across different regions of the Earth,
and that the performance differences are very minimal.
All right, once we know where the face is,
we now run our face mesh network that
predicts the 486 3D landmarks.
Here, as mentioned earlier, we use regression
instead of heat maps.
And we coupled it tightly to the detector,
reducing augmentations.
Now, the interesting bit is the training data.
And here, we use a mixture of real world data
as well as synthetic data.
I'm going to tell you why we're using both for this face mesh.
And the reason is really that it's
incredibly hard to annotate face meshes from the ground up.
And so instead, what we're using is some hybrid approach, where
we starting with synthetic data and a little bit of 2D
landmarks, train an initial model.
And then we have annotators patch up
predicted bootstrapped results, where the face much
is kind of OK.
That then goes to our pool of ground truth data.
We retrain the model.
And we iterate over that for some time
until we build up a data set of 30,000 ground truth images.
Now, you see here the annotation processor, where annotators
take the bootstrapped result. And then
they are basically patching up the slide misregistrations
over time.
One interesting question is, how good are humans at this task?
So if you give the same image to several annotators
and then you measure the variance,
they tend to agree within 2.5% of the inter ocular distance.
That is the distance between your pupils.
So that's kind of the gold standard
that we want to hit with our ML models.
And so for the models, we trained
a variety of capacities.
Some are lighter.
Some are heavier.
All of them run in real time on devices.
You can run them accelerated via TF Lite GPU or on TF Lite CPU.
The heavier models come actually very close
to human performance.
And you see the difference is fairly
small across the different regions of the Earth.
Now, the face mesh is now available since this week also
to developers, where you can run it directly in the browser
locally.
You only need to download the model.
But otherwise, your video does not
have to be streamed up to the cloud,
extremely privacy conscious.
It is running in real time.
You see the numbers on the slides.
And you can invoke it with just four lines of source code.
And we encourage you to try it out.
Before we talk about the next solution,
one more application of the face mesh
is that we can also, with an additional network,
compute continuous semantics from it,
that kind of tell you how much is your mouth
open, how big are your eyes open, or do you smile or not.
And we can set these semantic signals then
to drive virtual avatars like those
that you see on the right hand side
that we recently launched in Duo in order to drive
self expression effects.
OK, let's move on to hand meshes.
And here, we have MediaPipe hands,
which is released in MediaPipe as well as in TF.js,
available to developers.
Now, hand perception is a uniquely difficult problem.
Number one, when you look at hands,
you have to cover a huge scale range, right?
From cases close to the camera to far away from the camera.
In addition, hands tend to be heavily occluded by themselves.
For example, if you look at the fingers, but also with
respect to each other.
Then hands have a myriad of poses,
much more difficult than, for example, the gestures
that a face is able to make.
And then, of course, they have fairly low contrast
in comparison to faces, and in particular,
also when there are occluding a face.
But if you can solve it, there are very interesting use cases
that you can enable from gesture recognition to sign language.
Now, let's look at how this works.
Again, we built this as a MediaPipe pipeline,
consisting of a detector, locating
where palms or hands are.
And then we crop that region.
And we run a landmark model that actually then computes
the skeleton of a hand.
And as before, we don't have to run the detector
on every frame, but only really on the first one
or if our landmark model indicates a tracking list.
Let's look at the palm detector.
So one thing that's really interesting
is that we actually didn't build a hand detector.
We built a palm detector.
Why is that?
Well, palms are rigid.
So your fingers can be articulated.
But your palm tends not to deform that much.
In addition, palms are fairly small.
So when you use non-max suppression,
non-max suppression is often able
to still pull the locations apart,
even in cases of partial occlusions.
Palms are roughly square.
So you really only need to model one anchor.
And you don't need to have additional ones to account
for different aspect ratios.
And other than that, it's a fairly straightforward
SSD-style network, giving you a box and the seven
landmarks of a palm.
Let's look at a little bit more of the details here.
Compared to BlazeFace, BlazePalm has a little bit more
advanced network architecture.
So instead of predicting the location
at the lowest resolution of the network
where we have most of the semantic information,
we use a feature pyramid network approach to also up-sample
the image or the layer again in order
to locate palms that are smaller in the frame.
And because palms tend to be small,
you have a big imbalance between foreground and background.
And so we use a focal loss from the retina net paper
to address this.
If you put both of these ingredients together,
you see on the slides there's quite a nice boost in precision
that you get.
Let's look at the hand landmark model.
So after we look at the palms, we then
crop that region of the frame.
And we predict the 21 3D landmarks that
give you the hand skeleton.
And here, again, it's interesting
what we trained the data--
the network on.
There is a data set composed of real world as well
as synthetic data.
Let's look at the real world data.
For this, we asked our annotators
us to annotate 21 key points.
The task is here more simple and not quite as
challenging as for face meshes.
But because you have low contrast,
it's kind of hard to locate exactly where
the key point should be placed.
And so instead, we're annotating circles and just taking
the center, which helps us a little bit in having
better ground truth.
Now, next is the synthetic data, where we use a commercial model
to generate roughly 100,000 high quality
renders across a myriad of gestures
with HDR lighting for a realistic look.
If you take both of these data sets together
and you combine them, you can train a very performing hand
landmark model that works in 2D as well as in 3D.
And you see the performance here on the bottom right.
Now, one of the nice things of predicting
the individual coordinates of a hand skeleton
is that now you can predict and build
solutions that can return an arbitrary number of gestures.
You don't need to train a network for each gesture
individually.
But you can really scale of this problem.
And we invite you to work on this and try it out.
Because, now, HandPose is available in TF.js,
again, a solution with just four lines of code.
It works in real time, on phones, as well as MacBooks.
All right, let's move on to our last solution for today.
And that is MediaPipe Objectron, which was announced today.
MediaPipe Objectron provides you object detection in 3D
from just a single frame.
And what it's doing is it's basically predicting the 6 DOF
pose translation as well as rotation of objects
and, in addition, their 3 DOF dimension or extent.
It runs in real time on mobile.
And it's available in MediaPipe for you to play around with.
Now, the challenge in recognizing everyday objects
in 3D is the data, of course.
We were inspired by self-driving use cases
where you have a LiDAR to basically scan street scenes.
And here, we are relying on AR sessions.
That is information available by ARCore or ARKit.
That for every frame provides you with a 3D point cloud,
the plane geometry, as well as the camera poses.
So we built a capture app to collect 12,000 videos and also
a unique annotation tool, where now annotators
see a split view between the frame
as well as the ARCore and ARKit session data.
They then can position 3D bounding boxes
coinciding with objects.
And one of the really nice properties
here is that, because you do have the ground truth
camera pose, you only need to lable one frame.
And you have to pose in all of the frames.
The network architecture is a multitask network
where, for every input image, we are predicting three things.
Number one, the centroid of the object, then the offset
from that centroid to locate the eight corners that make up
a bounding, box and optionally a segmentation mask,
that you can use, for example, to model occlusions.
Here are these ingredients or tasks highlighted.
So you see the centroid in the jet color map.
You see the bounding boxes as well as, optionally,
the segmentation mask.
Objectron runs on mobile.
We released two models, one for localizing chairs,
as you can see on the left, and one for shoes.
And if you want to play around with all of these solutions,
please visit mediapipe.dev, where
you can also find other models around segmentation,
for example.
Thank you.
[MUSIC PLAYING]