Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • MATTHIAS GRUNDMANN: Hi, I'm Matthias.

  • And I'm here today to talk about machine learning

  • solutions for live perception.

  • What do we mean by live perception?

  • Basically any kind of machine learning that happens

  • live in the viewfinder, that is, on device in real time

  • with low latency.

  • There's an abundance of applications, for example,

  • virtual makeup and glasses, self-expression effects,

  • like Duo or Snapchat, or gesture control

  • and smart framing on devices like the Nest

  • device or the Portal device.

  • Live perception has various benefits.

  • Because we run all the machine learning on device

  • without any connection to the internet,

  • we also don't need to send any data up

  • to service, meaning it is inherently privacy conscious.

  • And because we don't need a network connection,

  • our results come also in immediately

  • and enables new use cases like creating live

  • in the viewfinder.

  • So this sounds great.

  • But what are the technical challenges?

  • Well, if you want to run machine learning in the viewfinder,

  • you really only have a budget of something

  • like 20 milliseconds per frame.

  • Because usually, you also want to run additional stuff

  • like rendering on top of that.

  • And that means you are constrained to low capacity

  • networks.

  • Classical approaches like distillation

  • tend to lead to low quality results.

  • And so you have to build machine learning models

  • with live perception in mind.

  • It's not just the machine learning models,

  • but it's also the infrastructure that you need to use.

  • And they go hand in hand.

  • And by using cutting edge cutting edge acceleration

  • techniques like TensorFlow lite, GPU or TensorFlow

  • GS where if it's WebGL or [INAUDIBLE] back ends,

  • you can achieve real time performance.

  • In addition, there's a framework called MediaPipe,

  • which is allowing us to run ML solutions, usually

  • multiple models, directly with GPU support and synchronization

  • support.

  • Speaking of synchronization, in live perception,

  • you do have a natural latency coming

  • from that ML inference as well as the inherent camera pipeline

  • latency.

  • And so this tends to add up to something

  • like 100 to 200 milliseconds, like you see on the GIF

  • in the right.

  • And a framework like MediaPipe helps you

  • with synchronization and buffering

  • to address those challenges.

  • So in today's talk, I'm going to talk

  • about three solutions, face meshes,

  • hand poses and object detection in 3D.

  • You'll see that we have a couple of recipes for live perception

  • that we recurrent across those solutions

  • and that, as number one, we use ML pipelines composed

  • of several smaller models versus one big model.

  • Our models are tightly coupled together.

  • And that will allow us to reduce augmentations.

  • And then we also heavily reduce the output layers

  • and we favor regression approaches over heat maps.

  • Let's jump in and with our first ML solution.

  • And that is MediaPipe FaceMesh, Google's Mobile 3D FaceMesh,

  • driving applications like YouTube, Duo, and 3P-APIs

  • like ML and ARCore.

  • It's a quite difficult problem to trying

  • to predict the high fidelity face

  • geometry without the use of dedicated hardware,

  • like a depth sensor.

  • Because we're predicting 486 points,

  • all the classical approaches like heat maps

  • don't really apply.

  • And then we have to achieve high quality.

  • For applications like virtual makeup,

  • we really want to be able to track faithfully

  • the lip contour so that the results look realistically.

  • MediaPipe FaceMesh is accelerated

  • with TensorFlow Lite GPU.

  • And it's available to developers via tf.js.

  • Here are some of the applications

  • that I just mentioned.

  • You see on the left AR ads in YouTube,

  • then self expression effects and Duo and more expression effects

  • in YouTube stories.

  • So how does it work?

  • The face mesh is modeled as a MediaPipe pipeline,

  • consisting of a BlazeFace face detector,

  • locating where the face is.

  • And then we crop that location and run on that 3D face mesh

  • network, that actually then returns 486 landmarks.

  • So the nice thing about coupling it as a pipeline

  • is that we actually don't need to run

  • the detector on every frame.

  • But instead, we can often reuse the location

  • that we computed in the previous frame.

  • So we only need to run BlazeFace on the first frame

  • or when there is a tracking error.

  • Let's look a little bit more into the details of BlazeFace.

  • So, as I mentioned, on some frames,

  • we have to run a detector.

  • And we have to run the tracking model.

  • And so our detector needs to be fast, blazingly fast.

  • We built a face detector that on some phones

  • locates faces in less than one millisecond.

  • And it's optimized for a set of use cases.

  • When we compare it to a bigger face detector,

  • then you see it has roughly the same precision

  • on self-use cases.

  • But it's significantly faster by nearly a factor of 4.

  • BlazeFace is available to developers via TF.js, TF Lite

  • and MediaPipe.

  • And developers have been using it for a couple of months.

  • And their response has been extremely positive.

  • They're telling us that it performs better

  • than the default solution that's available iOS.

  • We also provide a comprehensive ML Fairness variation

  • with our release.

  • So you can see how BlazeFace performs

  • across different regions of the Earth,

  • and that the performance differences are very minimal.

  • All right, once we know where the face is,

  • we now run our face mesh network that

  • predicts the 486 3D landmarks.

  • Here, as mentioned earlier, we use regression

  • instead of heat maps.

  • And we coupled it tightly to the detector,

  • reducing augmentations.

  • Now, the interesting bit is the training data.

  • And here, we use a mixture of real world data

  • as well as synthetic data.

  • I'm going to tell you why we're using both for this face mesh.

  • And the reason is really that it's

  • incredibly hard to annotate face meshes from the ground up.

  • And so instead, what we're using is some hybrid approach, where

  • we starting with synthetic data and a little bit of 2D

  • landmarks, train an initial model.

  • And then we have annotators patch up

  • predicted bootstrapped results, where the face much

  • is kind of OK.

  • That then goes to our pool of ground truth data.

  • We retrain the model.

  • And we iterate over that for some time

  • until we build up a data set of 30,000 ground truth images.

  • Now, you see here the annotation processor, where annotators

  • take the bootstrapped result. And then

  • they are basically patching up the slide misregistrations

  • over time.

  • One interesting question is, how good are humans at this task?

  • So if you give the same image to several annotators

  • and then you measure the variance,

  • they tend to agree within 2.5% of the inter ocular distance.

  • That is the distance between your pupils.

  • So that's kind of the gold standard

  • that we want to hit with our ML models.

  • And so for the models, we trained

  • a variety of capacities.

  • Some are lighter.

  • Some are heavier.

  • All of them run in real time on devices.

  • You can run them accelerated via TF Lite GPU or on TF Lite CPU.

  • The heavier models come actually very close

  • to human performance.

  • And you see the difference is fairly

  • small across the different regions of the Earth.

  • Now, the face mesh is now available since this week also

  • to developers, where you can run it directly in the browser

  • locally.

  • You only need to download the model.

  • But otherwise, your video does not

  • have to be streamed up to the cloud,

  • extremely privacy conscious.

  • It is running in real time.

  • You see the numbers on the slides.

  • And you can invoke it with just four lines of source code.

  • And we encourage you to try it out.

  • Before we talk about the next solution,

  • one more application of the face mesh

  • is that we can also, with an additional network,

  • compute continuous semantics from it,

  • that kind of tell you how much is your mouth

  • open, how big are your eyes open, or do you smile or not.

  • And we can set these semantic signals then

  • to drive virtual avatars like those

  • that you see on the right hand side

  • that we recently launched in Duo in order to drive

  • self expression effects.

  • OK, let's move on to hand meshes.

  • And here, we have MediaPipe hands,

  • which is released in MediaPipe as well as in TF.js,

  • available to developers.

  • Now, hand perception is a uniquely difficult problem.

  • Number one, when you look at hands,

  • you have to cover a huge scale range, right?

  • From cases close to the camera to far away from the camera.

  • In addition, hands tend to be heavily occluded by themselves.

  • For example, if you look at the fingers, but also with

  • respect to each other.

  • Then hands have a myriad of poses,

  • much more difficult than, for example, the gestures

  • that a face is able to make.

  • And then, of course, they have fairly low contrast

  • in comparison to faces, and in particular,

  • also when there are occluding a face.

  • But if you can solve it, there are very interesting use cases

  • that you can enable from gesture recognition to sign language.

  • Now, let's look at how this works.

  • Again, we built this as a MediaPipe pipeline,

  • consisting of a detector, locating

  • where palms or hands are.

  • And then we crop that region.

  • And we run a landmark model that actually then computes

  • the skeleton of a hand.

  • And as before, we don't have to run the detector

  • on every frame, but only really on the first one

  • or if our landmark model indicates a tracking list.

  • Let's look at the palm detector.

  • So one thing that's really interesting

  • is that we actually didn't build a hand detector.

  • We built a palm detector.

  • Why is that?

  • Well, palms are rigid.

  • So your fingers can be articulated.

  • But your palm tends not to deform that much.

  • In addition, palms are fairly small.

  • So when you use non-max suppression,

  • non-max suppression is often able

  • to still pull the locations apart,

  • even in cases of partial occlusions.

  • Palms are roughly square.

  • So you really only need to model one anchor.

  • And you don't need to have additional ones to account

  • for different aspect ratios.

  • And other than that, it's a fairly straightforward

  • SSD-style network, giving you a box and the seven

  • landmarks of a palm.

  • Let's look at a little bit more of the details here.

  • Compared to BlazeFace, BlazePalm has a little bit more

  • advanced network architecture.

  • So instead of predicting the location

  • at the lowest resolution of the network

  • where we have most of the semantic information,

  • we use a feature pyramid network approach to also up-sample

  • the image or the layer again in order

  • to locate palms that are smaller in the frame.

  • And because palms tend to be small,

  • you have a big imbalance between foreground and background.

  • And so we use a focal loss from the retina net paper

  • to address this.

  • If you put both of these ingredients together,

  • you see on the slides there's quite a nice boost in precision

  • that you get.

  • Let's look at the hand landmark model.

  • So after we look at the palms, we then

  • crop that region of the frame.

  • And we predict the 21 3D landmarks that

  • give you the hand skeleton.

  • And here, again, it's interesting

  • what we trained the data--

  • the network on.

  • There is a data set composed of real world as well

  • as synthetic data.

  • Let's look at the real world data.

  • For this, we asked our annotators

  • us to annotate 21 key points.

  • The task is here more simple and not quite as

  • challenging as for face meshes.

  • But because you have low contrast,

  • it's kind of hard to locate exactly where

  • the key point should be placed.

  • And so instead, we're annotating circles and just taking

  • the center, which helps us a little bit in having

  • better ground truth.

  • Now, next is the synthetic data, where we use a commercial model

  • to generate roughly 100,000 high quality

  • renders across a myriad of gestures

  • with HDR lighting for a realistic look.

  • If you take both of these data sets together

  • and you combine them, you can train a very performing hand

  • landmark model that works in 2D as well as in 3D.

  • And you see the performance here on the bottom right.

  • Now, one of the nice things of predicting

  • the individual coordinates of a hand skeleton

  • is that now you can predict and build

  • solutions that can return an arbitrary number of gestures.

  • You don't need to train a network for each gesture

  • individually.

  • But you can really scale of this problem.

  • And we invite you to work on this and try it out.

  • Because, now, HandPose is available in TF.js,

  • again, a solution with just four lines of code.

  • It works in real time, on phones, as well as MacBooks.

  • All right, let's move on to our last solution for today.

  • And that is MediaPipe Objectron, which was announced today.

  • MediaPipe Objectron provides you object detection in 3D

  • from just a single frame.

  • And what it's doing is it's basically predicting the 6 DOF

  • pose translation as well as rotation of objects

  • and, in addition, their 3 DOF dimension or extent.

  • It runs in real time on mobile.

  • And it's available in MediaPipe for you to play around with.

  • Now, the challenge in recognizing everyday objects

  • in 3D is the data, of course.

  • We were inspired by self-driving use cases

  • where you have a LiDAR to basically scan street scenes.

  • And here, we are relying on AR sessions.

  • That is information available by ARCore or ARKit.

  • That for every frame provides you with a 3D point cloud,

  • the plane geometry, as well as the camera poses.

  • So we built a capture app to collect 12,000 videos and also

  • a unique annotation tool, where now annotators

  • see a split view between the frame

  • as well as the ARCore and ARKit session data.

  • They then can position 3D bounding boxes

  • coinciding with objects.

  • And one of the really nice properties

  • here is that, because you do have the ground truth

  • camera pose, you only need to lable one frame.

  • And you have to pose in all of the frames.

  • The network architecture is a multitask network

  • where, for every input image, we are predicting three things.

  • Number one, the centroid of the object, then the offset

  • from that centroid to locate the eight corners that make up

  • a bounding, box and optionally a segmentation mask,

  • that you can use, for example, to model occlusions.

  • Here are these ingredients or tasks highlighted.

  • So you see the centroid in the jet color map.

  • You see the bounding boxes as well as, optionally,

  • the segmentation mask.

  • Objectron runs on mobile.

  • We released two models, one for localizing chairs,

  • as you can see on the left, and one for shoes.

  • And if you want to play around with all of these solutions,

  • please visit mediapipe.dev, where

  • you can also find other models around segmentation,

  • for example.

  • Thank you.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it