Placeholder Image

Subtitles section Play video

  • What's going on, everybody.

  • And welcome to a video about neural networks that speak.

  • Every voice in this video is generated from a neural network.

  • I know people will claim they could told me all I could hear the difference.

  • There's absolutely no way I would be fooled by this.

  • I have some of the best years, not too big, not useful.

  • Just right.

  • Give it long enough and you might realize something is wrong.

  • But I think this voice is pretty convincing.

  • It's not a question of if there will come a time where we can trust audio or video.

  • It is.

  • Right now.

  • We live in a time we're almost full of our information.

  • Comes from Texas Rodeo and value only know all of this can be fixed.

  • It's entirely possible that made things we've already seen or heard were faked like this.

  • I wonder how people will use this technology.

  • It seems like it would mostly be used for bad things.

  • What can we even do about this?

  • The reaction of many will be to just ban things like this, and this is not necessarily the answer.

  • The truth is, Impersonating others for nefarious purposes is already legal van actress from other countries do not care about our laws.

  • They will add this technology to their arsenal of cyber attack weaponry.

  • Citizens must remain vigilant where they get their information and must remain aware that the friend of fake all you and that he was here.

  • Now we live in an hour where most people get their information online from many sources, including social media.

  • While this has allowed for more freedom of the press which underpins our democracy, it has also along for foreign legions to pin afraid your flow of information.

  • That is why tonight I am proposing that we built an Internet wallet is a horrible If we look at Young or Russia, they do not have this crowd.

  • I don't think you understand how this wall would stop the flow of then Information might be the dumbest idea I have ever this wall you've ever seen but know this would be a beautiful Absolutely not.

  • People would come to visit just to see my Internet wall.

  • No, I estimate that the money made from cheers to see the Internet wall would pay for the Internet Will continues.

  • We might have this, Kyle, the issue here is just like any fig news.

  • Often the truth takes a long time to get out.

  • And some people will just never see the correction.

  • With the last year's presidential election, we saw just how much fake information could impact is by clocking up information.

  • And that was just with fake text.

  • I wonder how much more destructive fake audio and video will be all right and back to reality.

  • Or is it just getting or am I?

  • Anyway, I, uh, would like to do now is show you guys how you could get started with this, what's involved and hopefully simplify it a decent amount.

  • I had no idea how complicated sound could actually be with Spectra Grams and things like 48 transforms all this stuff, eh?

  • So I'm not gonna be getting too deep into the weeds, just kind of show you how you could do what I've done and kind of what I found on by no means done with doing fake audio.

  • And the next thing I'd like to do is fake video on.

  • Then combine the two.

  • So, um, there's still a lot I'm trying to do, and I just I don't really feel like doing the line by line thing.

  • But I thought I would share with you guys what I found so far.

  • So with that, the model that I ended up going with is this D C T.

  • T s model, which is, um I used this code here, but it's based on the following paper which take note was written in late 2017.

  • So almost two years ago, actually, But you know, the history of doing Texas speech on computers is and has always been, pretty much Someone goes into a studio and records hundreds of hours of audio, and then we use things like digital signal processing, or DSP is I'll probably refer to it from now on.

  • Uh, to do things like working on the pauses between words and sentences and then do do things to do, like the speech pattern and stuff like that.

  • It's all these like tools and hard coded, rule based things for Texas speech.

  • In most cases, even when doing Texas speech with neural networks, historically, people have applied Maur rule based things to the output.

  • Now, obviously, the dream would be to do text to speech without any of that just simply throw in some examples, get output.

  • That is just awesome.

  • So that was the goal.

  • And also side goal is to figure out How can we do this?

  • And how quickly can we do this?

  • And could we just do this with, like, anything on like YouTube?

  • Like, could we just take just about any YouTube video and mimic a voice on that video?

  • So with that in mind, this is the model I ended up with.

  • But I did try other models, including Taco Tron and Taco Tron V two, and I found this one to be the best, which is kind of curious and one of the things that you might notice if you've been reading and not listening or both, is this as deep convolution.

  • All that's right.

  • It's a convolution, l nor network.

  • There is not a recurrent cell to be seen, which is also interesting because traditionally you would think of a task like this to go from text to speech.

  • Text is a sequence.

  • Speech is also it's not like a single scaler value, right?

  • It's a sequence, so most of the time you would think, um, I would use the sequence to sequence model, which is a recurrent neural network.

  • Nope.

  • This is using deep convolution neural networks with attention.

  • Now this is guided attention, which will explain in a moment.

  • But interestingly enough, it's turning out over the last couple years to be the case that deep convolution, all neural networks are actually outperforming recurrent neural networks at these sequence to sequence types of tasks.

  • They're doing it in their training faster, and they're getting better results.

  • And that's awesome.

  • I'm not sure that I would be ready to say recurrent neural networks are dead.

  • We just need a better way to do recurrent neural networks because the biggest thing is the amount of data that's required and how slow they are to train.

  • So if we could speed them up, we might see good things.

  • So, um, for example, I recall using a recurrent neural network on like M n'est data.

  • So it's not like recurrent nor networks aren't like.

  • They're just as, um, trying to think a topic wide as a convolution Eleanor network can be, So I wouldn't necessarily write them off completely.

  • Um, don't forget, Neural networks themselves got written off completely not long ago, So anyway, um, yeah, Cool.

  • So what's guided?

  • Attention guided attention is this.

  • It's basically the assumption that everything is gonna be a linear.

  • So the input to the output will always be linear, as opposed to consider a machine translation task.

  • Where let's say you're going from English to Spanish and you've got big house needs to translate to Casa Grande A.

  • Okay, well, the adjective to describe that noun in English comes before the noun, but in Spanish comes after the noun.

  • So it's not a linear relationship.

  • In all cases, other words might be, but some are gonna be flipped around.

  • And so attention is there to actually help with this sequence of sequence, which was previously mostly used for recurrent neural networks.

  • But now we're using them in a confrontational neural networks as well.

  • And, it turns out, with convolution, all neural networks plus attention, we can do the exact same thing with the added benefit of doing it way faster.

  • The other nice benefit that I found is the end result is just better, so really cool.

  • So, for example, I just want to play a quick example of, um, let me see where it is Yes, o of a Texas speech that is pretty close to the typical ones I would find by training a mediocre sample size of, say, 600 samples.

  • So I'm just gonna play one real quick.

  • Hi.

  • Is lying programming like that?

  • Okay, um, you probably might not even understand what is being said there, but I'll play it one more time.

  • It's saying Python is the one and only true programming language is one.

  • So, yeah, that doesn't work well, So that's Taco Tron on 500 samples.

  • One thing to note, Taco Tron requires 100 more like 100 plus hours of sample audio and weeks to months to train.

  • So I did not do Taco Tron justice.

  • Also, there's now Taco John V, too.

  • So, um, just understand that that's just a example.

  • But I wanted to show you because every time I've ever played with Texas speech, that's about what it sounded like.

  • Like in terms of Texas Beach with neural networks.

  • So there's something there you can tell.

  • The neural network is kind of grasping onto things, but it's not.

  • It's just not there.

  • It's just simply not there.

  • So then I trained this deep convolution alone.

  • Neural network with this, this guy's code, Um, and I'll try to put a link in the description.

  • If I forget someone remind me.

  • You can also just google like d c T T s get hub.

  • And I'm confident you'll find this page.

  • Um, Anyways, I trained this model, and this one is on the l J speech data set, which is here, Um, which is a female voice.

  • I think it I don't even know.

  • I don't I don't know if it's Kate Winslet's voice, because there's he's got these other sample audio's You've got Kate Winslet, Nick Offerman, and then I'm not sure whose voice it is, but it's a singular voice.

  • You need to train with just one voice.

  • Otherwise you'll get a mashing of voices, which maybe that's what you want, but probably not.

  • Anyway, I used the L J speech data set and trained for I want to say about, like, three days like the 1st 1 The so this this model has two networks will explain that Maurin a moment the 1st 1 took about one day.

  • The 2nd 1 took about two days and this is the result that I got.

  • Now, this voice is quite good, in my opinion.

  • And it was trained with many hours of sample audio from an audiobook.

  • Okay, so that is still a little robotic.

  • You can, if you really pay close attention, you can kind of hear these weird little fluctuations with somebody who knows more than I do.

  • I really don't know much about digital signal processing and all that stuff.

  • Someone who's really good at it could fix that voice and use like something really specific, probably for that voice and fix it.

  • But this is straight from the from the D C T t s model output from here on, get up.

  • And that's pretty good.

  • I mean, that's clearly a person we would accept listening to that audio.

  • Um, I think I was just I was pretty staggered by those results because I've never seen a Texas speech model produced such results besides Taco Tron, which is a big pain to train now.

  • That data set consists of 13,000 samples, and it's about 24 hours total of audio.

  • It's a lot of audio.

  • So my initial thought was, What can I do this with YouTube and how little how few samples can I use?

  • So then I started.

  • I tried to use my own videos because I felt like if I'm gonna fake someone's voice a price should do my my own first and then do other people's.

  • It's kind of like if you're a police officer and you wanna have a Taser, they make you taste yourself first.

  • It's got the same thing.

  • Um, so So So So my hope was that I could take my YouTube videos and then also take the caption file for the YouTube videos.

  • So I tried that.

  • But the problem is, the caption file doesn't quite like the time stamps on the caption file aren't quite right.

  • And then the caption file also eliminates often things like, um and, uh, other like stutters and seven.

  • So it's just because of the guided attention.

  • Um, this is such an integral part here because Theus assumption is everything will be linear.

  • Everything will line up.

  • We're saying with 100% certainty, So we need it to actually lineup.

  • So has to be perfect.

  • Unfortunately, caption file no go so that I'm thinking, Well, maybe captions or that way because, um because it's better for the person to read it so you could read it faster if we don't do include this stutters and stuff.

  • So then it's like, Okay, let's try using, like Google's speech to text.

  • Try that similar problems.

  • It would chop things.

  • It would leave things out.

  • It would fix stutters.

  • It would get rid of bums and us.

  • So then I'm like, Well, I guess I'll manually transcribed this stuff And I don't think probably anyone watching this video has manually transcribed audio.

  • But let me just say it is the worst thing.

  • It is the most painful thing.

  • I can't think of a thing more painful than transcribing audio Attn Least, unless it's like, really clean audio.

  • So in the in my case, there's a lot of stutters.

  • There's this and, uh, lots of us, and lots of times, I, like, repeat a word twice or three times, and sometimes it's really hard to hear, and you have to go back and play and like it is, What I found personally is my like, I just naturally will nix when people say, uh um, like a lot of times, I just like, don't even hear it.

  • So I'm sitting here typing it out, and then usually I'd play it back one more time or two more times after it was fully typed out just to make sure.

  • And it was just staggering How many times I would miss things, even like to twice I would miss things when I'm, like, really trying to pay attention to it.

  • So, um, really interesting.

  • And also just painful just painful to, uh, to, ah transcribed things.

  • So then I'm like, Oh, my gosh, because they're, you know, the L J speech data set is 13,000 samples.

  • I was finding I could do about 100 samples and our transcribing of so and then I could do it, Max about like 200 a day.

  • Beyond that, it was like just killer just too, too brutal.

  • And so I do the math.

  • That's two months to get a new voice and truck outside my house.

  • A love a love on your recording.

  • Like every time I record big trucks have to come by, but okay, so after manually transcribing about 600 samples, the result I got Waas Python is one only short of programming language.

  • So that one's not too bad.

  • There's a couple of small issues.

  • One is it sounds like I'm underwater.

  • The other one is You can hear these like little flickers in the background.

  • And I think that's the model actually trying to recreate the sound of the keyboard, which is cool but also not necessary.

  • Don't really want that in the audio unless I want to add it after the fact.

  • So, um, it sounds better, but clearly we need more work and, um, it's just not great.

  • It's not as good as that original voice.

  • So then I'm thinking to myself, Man, do I really like Am I really gonna have to do thousands of samples?

  • Because generally, you know, your initial results are going to be pretty good.

  • And then, you know, it's the law of diminishing returns.

  • So the more samples you get, the less you know improvement you're going to see per new sample.

  • So So now I'm just like me and what I do.

  • And then I think, Well, what about transfer learning?

  • So I decide.

  • Well, this model sounds good.

  • Now, this voice is quite good, in my opinion, and it was trained with many hours of sample audio from an audio book.

  • So could I take that that model right with the two networks, do a transfer learn.

  • Could I get there?

  • So So my next question was Okay, well, like time to look back at the model and figure out how everything's working.

  • So let me explain briefly how this model actually works.

  • So this D C T T s model, basically the way it works, has got two models.

  • The first model is a text to Mel is with their column it And this is basically the conversion of text to a Mel Spectra Graham, then the second model.

  • And that's also where your attention is.

  • Then the second model, This is your s S R N or Spectra Grams super resolution network, and that one just basically kind of clarifies everything else.

  • So my question was, What if I take that voice that the female voice feed in, say, a few 100 examples of my own voice and then do a few more 1000 steps of learning effectively just doing a quick transfer there.

  • Could that work?

  • And, um, what I found was first, I had to figure out how many steps to go in.

  • At least what I found was it takes about somewhere between two and 15,000 steps on the on model one.

  • The text to Mel takes about that many steps and then on the S S r n, it really takes like 1 to 4000 steps.

  • Really, no more is required.

  • I was kind of baffled by how quickly that one learns.

  • I'm not really sure why.

  • That's learning.

  • Why the super, I guess because all that model really is doing is learning how to take a Mel and convert to this like super resolution.

  • So maybe that's it.

  • But I will say you have to train it.

  • You can't not train it.

  • Otherwise the voice just will sound wrong.

  • So what I found was to train.

  • So, for example, also, before I show you some pictures, um, to train, let's say zero toe 15,000 steps.

  • That's it.

  • That's our week.

  • 1 to 15,000 steps.

  • That's about like 3 to 45 minutes.

  • It was like a step 1000 steps per three minutes.

  • So that's for Model one.

  • And then to do model to it would take, you know, to do 2 to 4, it would take about 30 minutes.

  • Okay, let's say so.

  • All in all about an hour and 15 minutes to train a new model.

  • Now, the question was, of course, how few samples could we do this with?

  • So, um, first let me play the voice I got, um I think this one's an early transfer.

  • I'm just gonna play it.

  • I'm not I'm a human like you.

  • Yeah.

  • This is an early train.

  • It sounds like I don't actually have this properly labeled I a human like you and it sounds to me like it It's a transfer of one.

  • So model one, the text to Mel, but not the SSR n.

  • But then, after significant training on both sides of this, we get a voice like this, which I think is pretty good.

  • So that sounds pretty darn close to me.

  • We get a voice like this, which I think is pretty good.

  • There's a few pronunciation issues, but again, the pronunciation issues will go away.

  • The more samples that you have.

  • So in this case, that voice is off about 200 samples.

  • So not many at all.

  • And to, um to transcribe that many samples it would take maybe two hours.

  • So what I did was I Basically, we went from taking, like, over two months to create a new voice.

  • Brought that down to about 3.5 hours for anyone's voice.

  • Given that you have, um, approximately 200 decent samples.

  • So and again, each one is anywhere between, um, two in 10 seconds.

  • I think the average is usually around, like, six or seven or something like that.

  • So about 15 minutes of audio.

  • So you just need 15 minutes of audio.

  • You can try with less, but then you get weird things.

  • So, for example, let me play.

  • They play this one for you.

  • Hello?

  • Minding this and they're still mine, not the machine.

  • I call that one crisis of identity because you can even hear.

  • It's like flipping back and forth between female and male voice, which is kind of funny.

  • So, um okay, So as I trained, let me just show you kind of what I'm looking for.

  • So I'm just gonna grab Centex three just as an example.

  • Different voices seem to require different amounts of time.

  • And if you were sitting here physically, you can kind of more manually, like if you see the attention going in the wrong direction, you kind of fix for it.

  • So one of the really cool things I don't know some of you guys have seen attention graph.

  • So this is that 1000 steps, 2000 steps, you can see it starts at least two like form, um, that line there and then about 4000 steps.

  • There's some serious attention here.

  • And just for the record, like without this guided attention, this takes a lot longer to happen.

  • So the fact that this is already happening by like 45 K likes these are already looking like these are already pretty darn good looking attention graphs like Dang So, um, now going backwards.

  • I'm just going to start us to this.

  • The female voice model stopped here.

  • This is where I stopped it.

  • Now you could train this this one first a little more and then do the transfer like there's so many more things for me to trial and error here.

  • And I've already tried on aired a lot of things, but this is where I stopped.

  • So then 1000 steps with a new voice.

  • You can see the attentions kind of gets a little confused.

  • The resolution gets quite poor and then kind of loses attention towards the end there.

  • Then, by 1 87 it's also not really that great.

  • 1 88 At least it's sort of improving.

  • 1 80 nine's improving, and then we keep going.

  • I forget where I stop on this one.

  • If it's 1 99 no one's pretty good.

  • Not very good, right?

  • Losing attention towards the end there.

  • And basically, I just keep training until I see a pretty good one.

  • Like so, for example, this one's pretty good like Look how yellow that Ismay like yellow is good, right?

  • That's more closely aligned.

  • Basically, um, that one's bad resolution.

  • And 1 99 I'm pretty sure it's when I stopped on Yeah, so I could keep going on this voice, but you can see it kind of loses attention towards the end, but that's still pretty good.

  • And that's the voice you've been hearing as my fake voice throughout this video.

  • So hey, wait.

  • Um, that's just like an example.

  • So you're just you're just looking for these graphs to kind of tighten up and turn yellow and make it to the end without this along a long line there, basically to simplify it.

  • So cool.

  • So besides that, I think that's really everything I really wanted to show you guys now moving forward.

  • There might come a time where I do more of a line by line tutorial, but like I said, I'm just not done.

  • And so if you guys want to take part in doing it, it's really simple.

  • Download this person's code.

  • Okay, Um and then, uh, basically, you go into the data set.

  • So this l j speech data set if I click on this measure was copy, but metadata contains all your transcription.

  • So if I click on that, that's not what I wanted to open it with.

  • You already have that.

  • I don't have it open.

  • Um, let's just open it in a text editor.

  • You can see here it's just separated by bar here.

  • But also take note.

  • There's two translations.

  • I'm looking for the bar I am not seeing.

  • Where's that bar?

  • Okay, I couldn't find it on the 1st 1 but there's the bar.

  • Here is the bar for the number six, basically.

  • And so you could do, you could make your own.

  • So, for example, like when I was making my own audio, what I would do is, um, I decided to do so.

  • I made my own by simply I created a program.

  • It would start with a label like this bar.

  • Show me the text that wanted me to say I would say the text recorded it with some pie audio, um, and then output this file and then it out.

  • Put the waves in this directory, and these are all your waves that match.

  • So the file name here matches up with the name here.

  • So if you want to make your own, that's how you do it Now with the YouTube videos, like for Elon Musk Voice and Donald Trump and, uh, Obama and whoever else I did, Um, for those, unfortunately, I wish the captions worked almost kind of want to, like, throw away guided attention and just try regular attention because it was it would be so much more cool to just automate the entire process.

  • And I still think I can automate it.

  • But the way that we did that was, um, download the audio split by silence try to make the clips between two and 10 seconds long, split by silence, of course.

  • And then I would play it and then transcribe it that way.

  • So it's still very, very manual, and that's like the most.

  • That's the longest part of the task, and that's the only part of the task that really requires human input.

  • Um, so be cool to automate that process for sure.

  • Um, but anyways, I did find out that you could definitely, um, fake someone's voice with 15 minutes of audio, which is bonkers.

  • I mean, that's just crazy.

  • It's so it's a very small amount of audio that you need, and the fact that you can do it in about three hours is also pretty crazy.

  • So moving forward would like to do is continue looking to see if I can make these voices any better, so the fewer the fewer of samples that you add.

  • It starts to pronounce certain things weird like, for example, if you try to make some of these other voices as I transcribed them, the term, like neural network never came up.

  • So when you would have them say neural, sometimes a new role or something weird just didn't sound right.

  • So one of mine tryingto figure out how How can we get these models to be just a little better?

  • And then maybe maybe even use a bass voice or, like, have different bass voice is so if you start with like a ah, Bridget have, like a British based voice, male and female, a US base, voice, male and female, and then start from there, I think you'd probably have better results, because keep in mind, all I did was create a bunch of male voices from a female voice, a female British voice.

  • And I didn't do any British people.

  • So So um yeah, ah, lot, A lot of things that we could definitely test moving forward.

  • But anyway, I thought this was really cool.

  • Obviously, there's all kinds of ramifications here.

  • We, you know, I think most people are completely unaware that this technology exists.

  • Um, now, nowadays, if you get a phone call, you can spoof the phone number, first of all, but also we can now spook voices.

  • So you have a phone call from a family member.

  • You have no idea.

  • Right?

  • We all know that strange e mails with suspicious links in them from family members We probably shouldn't trust, but if you get a phone call from a family member, you probably believe it's actually them.

  • But it might not be so.

  • That's pretty crazy also with, like, fake news and stuff like that, like in the last 2016 presidential election.

  • I mean, it was so hard to keep up just with the amount of fake information and text form that was coming out.

  • Now imagine, like all the fake information that can come out with fake audio and video.

  • Um, you know, yes, we can validate this stuff, but it takes time just like it does with the text.

  • So, you know, it's almost like a social denial of service that I feel like went on during the you know, the last election.

  • And it's only this kind of stuff is only gonna get worse.

  • I mean, just imagine floods of fake information and text, video and audio form.

  • Um, it's crazy.

  • It's crazy.

  • I think this is a really important field that people were paying attention to.

  • So, um, the other thing I'd like to work on besides doing fake video in the form like I'd really like to create a text to video.

  • So besides doing that, I'd also like to start working on adversarial networks to this.

  • So just detecting fake audio and video, Um, I think pretty much like all social media websites and news agencies are going to need this to start validating this stuff that's coming through.

  • Um, because I think is gonna be a big deal.

  • It's already a big deal.

  • And we saw how big of a deal it was in 2016.

  • Um, I think it's only gonna get worse.

  • So anyways, on that super light note, uh, shout out to my most recent channel members, thank you guys very much for your support.

  • Uh, I'm gonna attempt to say all of these names.

  • I'm sorry if I mispronounced.

  • Feel free to correct me in the comments section.

  • Uh, Luke Guests on ready Casey, me hand.

  • Peter Freeman, Black Hawk, Pranic Kothari, Jesse Jones, Pain Max, Philip Wagner, Dylan Die and up.

  • Dom Main, Anders Nilsson.

  • I Adele J'ai Jason Robert not Brian Bradley and Carter Babin.

  • Thank you all very much.

  • All brand new members.

  • Welcome.

  • Welcome.

  • Hope you enjoy the content.

  • Um, okay.

  • So questions, comments, whatever Feel free living below If you've got any suggestions on where to look, Obviously I know about deep fakes, but I'm not actually trying to do deep fakes for the text video stuff.

  • Um What?

  • I'm quite honestly trying to completely generate things.

  • So there are There is some research and talking heads, but I've never seen any code for it.

  • And the papers air.

  • Still kind of vague about the model.

  • So, uh, I don't know if I'm gonna have to just try to figure that one out all on my own.

  • And by on my own.

  • I mean, me and Daniel.

  • So, um, gotta be really interesting.

  • But any suggestions there as well feel free thio.

  • Leave them below or come hang out with us in discord.

  • Doggy slash Centex.

  • Otherwise, that's it.

  • Guys seeing another video, real or fake?

What's going on, everybody.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it