Subtitles section Play video Print subtitles [MUSIC PLAYING] MATTHIAS GRUNDMANN: Hi, I'm Matthias. And I'm here today to talk about machine learning solutions for live perception. What do we mean by live perception? Basically any kind of machine learning that happens live in the viewfinder, that is, on device in real time with low latency. There's an abundance of applications, for example, virtual makeup and glasses, self-expression effects, like Duo or Snapchat, or gesture control and smart framing on devices like the Nest device or the Portal device. Live perception has various benefits. Because we run all the machine learning on device without any connection to the internet, we also don't need to send any data up to service, meaning it is inherently privacy conscious. And because we don't need a network connection, our results come also in immediately and enables new use cases like creating live in the viewfinder. So this sounds great. But what are the technical challenges? Well, if you want to run machine learning in the viewfinder, you really only have a budget of something like 20 milliseconds per frame. Because usually, you also want to run additional stuff like rendering on top of that. And that means you are constrained to low capacity networks. Classical approaches like distillation tend to lead to low quality results. And so you have to build machine learning models with live perception in mind. It's not just the machine learning models, but it's also the infrastructure that you need to use. And they go hand in hand. And by using cutting edge cutting edge acceleration techniques like TensorFlow lite, GPU or TensorFlow GS where if it's WebGL or [INAUDIBLE] back ends, you can achieve real time performance. In addition, there's a framework called MediaPipe, which is allowing us to run ML solutions, usually multiple models, directly with GPU support and synchronization support. Speaking of synchronization, in live perception, you do have a natural latency coming from that ML inference as well as the inherent camera pipeline latency. And so this tends to add up to something like 100 to 200 milliseconds, like you see on the GIF in the right. And a framework like MediaPipe helps you with synchronization and buffering to address those challenges. So in today's talk, I'm going to talk about three solutions, face meshes, hand poses and object detection in 3D. You'll see that we have a couple of recipes for live perception that we recurrent across those solutions and that, as number one, we use ML pipelines composed of several smaller models versus one big model. Our models are tightly coupled together. And that will allow us to reduce augmentations. And then we also heavily reduce the output layers and we favor regression approaches over heat maps. Let's jump in and with our first ML solution. And that is MediaPipe FaceMesh, Google's Mobile 3D FaceMesh, driving applications like YouTube, Duo, and 3P-APIs like ML and ARCore. It's a quite difficult problem to trying to predict the high fidelity face geometry without the use of dedicated hardware, like a depth sensor. Because we're predicting 486 points, all the classical approaches like heat maps don't really apply. And then we have to achieve high quality. For applications like virtual makeup, we really want to be able to track faithfully the lip contour so that the results look realistically. MediaPipe FaceMesh is accelerated with TensorFlow Lite GPU. And it's available to developers via tf.js. Here are some of the applications that I just mentioned. You see on the left AR ads in YouTube, then self expression effects and Duo and more expression effects in YouTube stories. So how does it work? The face mesh is modeled as a MediaPipe pipeline, consisting of a BlazeFace face detector, locating where the face is. And then we crop that location and run on that 3D face mesh network, that actually then returns 486 landmarks. So the nice thing about coupling it as a pipeline is that we actually don't need to run the detector on every frame. But instead, we can often reuse the location that we computed in the previous frame. So we only need to run BlazeFace on the first frame or when there is a tracking error. Let's look a little bit more into the details of BlazeFace. So, as I mentioned, on some frames, we have to run a detector. And we have to run the tracking model. And so our detector needs to be fast, blazingly fast. We built a face detector that on some phones locates faces in less than one millisecond. And it's optimized for a set of use cases. When we compare it to a bigger face detector, then you see it has roughly the same precision on self-use cases. But it's significantly faster by nearly a factor of 4. BlazeFace is available to developers via TF.js, TF Lite and MediaPipe. And developers have been using it for a couple of months. And their response has been extremely positive. They're telling us that it performs better than the default solution that's available iOS. We also provide a comprehensive ML Fairness variation with our release. So you can see how BlazeFace performs across different regions of the Earth, and that the performance differences are very minimal. All right, once we know where the face is, we now run our face mesh network that predicts the 486 3D landmarks. Here, as mentioned earlier, we use regression instead of heat maps. And we coupled it tightly to the detector, reducing augmentations. Now, the interesting bit is the training data. And here, we use a mixture of real world data as well as synthetic data. I'm going to tell you why we're using both for this face mesh. And the reason is really that it's incredibly hard to annotate face meshes from the ground up. And so instead, what we're using is some hybrid approach, where we starting with synthetic data and a little bit of 2D landmarks, train an initial model. And then we have annotators patch up predicted bootstrapped results, where the face much is kind of OK. That then goes to our pool of ground truth data. We retrain the model. And we iterate over that for some time until we build up a data set of 30,000 ground truth images. Now, you see here the annotation processor, where annotators take the bootstrapped result. And then they are basically patching up the slide misregistrations over time. One interesting question is, how good are humans at this task? So if you give the same image to several annotators and then you measure the variance, they tend to agree within 2.5% of the inter ocular distance. That is the distance between your pupils. So that's kind of the gold standard that we want to hit with our ML models. And so for the models, we trained a variety of capacities. Some are lighter. Some are heavier. All of them run in real time on devices. You can run them accelerated via TF Lite GPU or on TF Lite CPU. The heavier models come actually very close to human performance. And you see the difference is fairly small across the different regions of the Earth. Now, the face mesh is now available since this week also to developers, where you can run it directly in the browser locally. You only need to download the model. But otherwise, your video does not have to be streamed up to the cloud, extremely privacy conscious. It is running in real time. You see the numbers on the slides. And you can invoke it with just four lines of source code. And we encourage you to try it out. Before we talk about the next solution, one more application of the face mesh is that we can also, with an additional network, compute continuous semantics from it, that kind of tell you how much is your mouth open, how big are your eyes open, or do you smile or not. And we can set these semantic signals then to drive virtual avatars like those that you see on the right hand side that we recently launched in Duo in order to drive self expression effects. OK, let's move on to hand meshes. And here, we have MediaPipe hands, which is released in MediaPipe as well as in TF.js, available to developers. Now, hand perception is a uniquely difficult problem. Number one, when you look at hands, you have to cover a huge scale range, right? From cases close to the camera to far away from the camera. In addition, hands tend to be heavily occluded by themselves. For example, if you look at the fingers, but also with respect to each other. Then hands have a myriad of poses, much more difficult than, for example, the gestures that a face is able to make. And then, of course, they have fairly low contrast in comparison to faces, and in particular, also when there are occluding a face. But if you can solve it, there are very interesting use cases that you can enable from gesture recognition to sign language. Now, let's look at how this works. Again, we built this as a MediaPipe pipeline, consisting of a detector, locating where palms or hands are. And then we crop that region. And we run a landmark model that actually then computes the skeleton of a hand. And as before, we don't have to run the detector on every frame, but only really on the first one or if our landmark model indicates a tracking list. Let's look at the palm detector. So one thing that's really interesting is that we actually didn't build a hand detector. We built a palm detector. Why is that? Well, palms are rigid. So your fingers can be articulated. But your palm tends not to deform that much. In addition, palms are fairly small. So when you use non-max suppression, non-max suppression is often able to still pull the locations apart, even in cases of partial occlusions. Palms are roughly square. So you really only need to model one anchor. And you don't need to have additional ones to account for different aspect ratios. And other than that, it's a fairly straightforward SSD-style network, giving you a box and the seven landmarks of a palm. Let's look at a little bit more of the details here. Compared to BlazeFace, BlazePalm has a little bit more advanced network architecture. So instead of predicting the location at the lowest resolution of the network where we have most of the semantic information, we use a feature pyramid network approach to also up-sample the image or the layer again in order to locate palms that are smaller in the frame. And because palms tend to be small, you have a big imbalance between foreground and background. And so we use a focal loss from the retina net paper to address this. If you put both of these ingredients together, you see on the slides there's quite a nice boost in precision that you get. Let's look at the hand landmark model. So after we look at the palms, we then crop that region of the frame. And we predict the 21 3D landmarks that give you the hand skeleton. And here, again, it's interesting what we trained the data-- the network on. There is a data set composed of real world as well as synthetic data. Let's look at the real world data. For this, we asked our annotators us to annotate 21 key points. The task is here more simple and not quite as challenging as for face meshes. But because you have low contrast, it's kind of hard to locate exactly where the key point should be placed. And so instead, we're annotating circles and just taking the center, which helps us a little bit in having better ground truth. Now, next is the synthetic data, where we use a commercial model to generate roughly 100,000 high quality renders across a myriad of gestures with HDR lighting for a realistic look. If you take both of these data sets together and you combine them, you can train a very performing hand landmark model that works in 2D as well as in 3D. And you see the performance here on the bottom right. Now, one of the nice things of predicting the individual coordinates of a hand skeleton is that now you can predict and build solutions that can return an arbitrary number of gestures. You don't need to train a network for each gesture individually. But you can really scale of this problem. And we invite you to work on this and try it out. Because, now, HandPose is available in TF.js, again, a solution with just four lines of code. It works in real time, on phones, as well as MacBooks. All right, let's move on to our last solution for today. And that is MediaPipe Objectron, which was announced today. MediaPipe Objectron provides you object detection in 3D from just a single frame. And what it's doing is it's basically predicting the 6 DOF pose translation as well as rotation of objects and, in addition, their 3 DOF dimension or extent. It runs in real time on mobile. And it's available in MediaPipe for you to play around with. Now, the challenge in recognizing everyday objects in 3D is the data, of course. We were inspired by self-driving use cases where you have a LiDAR to basically scan street scenes. And here, we are relying on AR sessions. That is information available by ARCore or ARKit. That for every frame provides you with a 3D point cloud, the plane geometry, as well as the camera poses. So we built a capture app to collect 12,000 videos and also a unique annotation tool, where now annotators see a split view between the frame as well as the ARCore and ARKit session data. They then can position 3D bounding boxes coinciding with objects. And one of the really nice properties here is that, because you do have the ground truth camera pose, you only need to lable one frame. And you have to pose in all of the frames. The network architecture is a multitask network where, for every input image, we are predicting three things. Number one, the centroid of the object, then the offset from that centroid to locate the eight corners that make up a bounding, box and optionally a segmentation mask, that you can use, for example, to model occlusions. Here are these ingredients or tasks highlighted. So you see the centroid in the jet color map. You see the bounding boxes as well as, optionally, the segmentation mask. Objectron runs on mobile. We released two models, one for localizing chairs, as you can see on the left, and one for shoes. And if you want to play around with all of these solutions, please visit mediapipe.dev, where you can also find other models around segmentation, for example. Thank you. [MUSIC PLAYING]
B1 detector network data frame perception mesh ML solutions for Live Perception (TF Dev Summit '20) 3 0 林宜悉 posted on 2020/04/04 More Share Save Report Video vocabulary