Subtitles section Play video
This is all a conspiracy, don't you know that, it's a conspiracy. Yes, yes, yes!
Good evening, my fellow Americans. Fate has ordained that the men who went to the moon to explore in
peace will stay on the moon to rest in peace. That President Nixon video you just watched is
a deep fake. It was created by a team at MIT as an educational tool to highlight how
manipulated videos can spread misinformation - and even rewrite history. Deepfakes have
become a new form of altering reality, and they’re spreading fast. The good ones can
chip away at our ability to discern fact from fiction, testing whether seeing is really
believing. Some have playful intentions, while others can cause serious harm. People have
had high profile examples that they put out that have been very good, and I think that
moved the discussion forward both in terms of, wow, this is what's possible with this
given enough time and resources, and can we actually tell at some point in time, whether
things are real or not? A deep fake doesn't have to be a complete picture of something.
It can be a small part that's just enough to really change the message of the medium.
See I would never say these things, at least not in the public address. But someone else
would. Someone like Jordan Peele.
A deep fake is a video or an audio clip that's been altered
to change the content using deep learning models. The deep part of the deep fake that
you might be accustomed to seeing often relies on a specific machine learning tool. A GAN
is a generative adversarial network and it's a kind of machine learning technique. So in
the case of deep fake generation, you have one system that's trying to create a face,
for example. And then you have an adversary that is designed to detect deep fakes. And
you use these two together to help this first one become very successful at generating faces
that are very hard to detect by using another machine learning technique. And they just
go back and forth. And the better the adversary, the better the producer will be.
One of the reasons why GANs have become a go-to tool for deep fake creators is because of the data
revolution that we’re living in. Deep learning has been around a long time, neural networks
were around in the '90s and they disappeared. And what happened was the internet. The internet
is providing enormous amounts of data for people to be able to train these things with
armies of people giving annotations. That allowed these neural networks that really
were starved for data in the '90s, to come to their full potential. While this deep learning
technology improves everyday, it’s still not perfect. If you try to generate the entire
thing, it looks like a video game. Much worse than a video game in many ways. And so people
have focused on just changing very specific things like a very small part of a face to
make it kind of resemble a celebrity in a still image, or being able to do that and
allow it to go for a few frames in a video.
Deep fakes first started to pop up in 2017,
after a reddit user posted videos showing famous actresses in porn. Today, these videos
still predominantly target women, but have widened the net to include politicians saying
and doing things that haven't happened. It's a future danger. And a lot of the groups that
we work with are really focused on future dangers and potential dangers and being abreast
of that. One of these interested groups has been DARPA. They sent out a call to researchers
about a program called Media Forensics, also known as MediFor. It's a DARPA project that's
geared towards the analysis of media. And originally it started off as very much focused
on still imagery, and detecting, did someone insert something into this image? Did someone
remove something? It was before deep fakes became prominent. The project focus changed
when this emerged. At SRI International,
Aaron and his team have been working across disciplines to create a multi-pronged approach
for detecting deep fakes. The system they’ve developed is called SAVI. So our group focused
on speech. And in the context of this SAVI program, we worked with people in the artificial
intelligence center who are doing vision. And put our technologies together to collaborate
on coming up with a set of tools that can detect things like, here's the face. Here's
the identity of the face. It's the same person that was earlier in the video. The lips are
moving, okay. And then we use our speech technology and say, "Can we verify that this piece of
audio and this piece of audio came from the same speaker or a different speaker?" And
then put those together as a tool that would say, "If you see a face and you see the lips
moving, the voice should be the same or you wanna flag something." However, there is always
a worry that making these detection systems more available could unintentionally provide
deep fake creators with workarounds. If released, the methods meant to catch the altered media,
could potentially drive the next generation of deep fakes. As a result, these detection
systems have to evolve. In its newest iteration, Aaron gave us a run through of how various
aspects of the system work, without giving too much away. This is an explicit lip sync
detection. What we're doing here is we're learning from audio and visual
tracks what the lip movement should be given some speech and vice versa. And we're detecting
when that deviates from what you would expect to see and hear. While some techniques can
work well on their own, most fair better when combined into a larger detection system. So
in this video you'll see Barack Obama giving a speech about Tom Vilsack, one of his departing
cabinet members. And we're running this live through our system here, which is processing
basically to identify two kinds of information. The top one where it says natural is a model that's
detecting is this natural or some type of synthesized or generated speech, essentially
a deep fake. In the bottom, is detecting identity based on voice, so we have a model
of Barack Obama so it's saying this continues to verify as Obama and this will continue
like this until now we get Jordan Peele imitating Barack Obama. We're entering an era in which
our enemies can make it look like anyone is saying anything at any point in time. And
that whole section here was Jordan Peele. He’s natural, but he’s not Obama. I would
say for detection of synthesis or voice conversion, we're in the sub 5% error rate for what I
would call laboratory conditions. And probably in the real world, it would be higher than
that. That's why having these multi-pronged things is really important. However, technology
is only part of the equation. How we as a society respond to these altered pieces of
content is as important. The media tends to focus on the technological aspects of things
rather than the social. The problem is less the deep fakes and more the people who are
very willing to believe something that is probably not well done because it confirms
something that they already believe. Reality becomes an opinion rather than fact. And it
gives you license to misbelieve reality. It's really hard to predict what will happen. You
don't know if this is going to be something that five years from now people actually nail
down or if it's 40 years from now. It's one of those things that is still sort of exciting,
interesting and new and you don't know what the limitations are yet.