AI Learns to Write Rap Lyrics! - VoiceTube: Learn English through videos!

Subtitles section Play video

Crazy's like a boogie out in low ninja.
Big fun bro, yeah?
Alright, so hi everybody, it's me, Cary C- Now, I've always thought of myself as a musical person.
Isn't it amazing?
No, no Cary, that isn't amazing.
Anyway, given that I've used AI to compose Baroque music, And I've used AI to compose jazz music, I think it just makes sense for me to fast forward the musical clock another 60 years to compose some rap music.
But before I do that, I gotta give credit to Siraj Raval, who actually did this first.
Homie grows punani, likely I'm totin' inspired enough.
But you know what they say, no rap battle's complete without two contenders.
So, what did I do to build my own digital rap god?
Well, I used Andrej Karpathy's recurrent neural network code again.
An RNN is just an ordinary neural network, but we give it a way to communicate with its future self with this hidden state, meaning it can store memory.
Now, I've done this countless times before, so I won't dive too deep into what an RNN is.
Instead, I want to focus more on a twist I implemented that makes this quote-unquote algorithm more musical.
Before I do that though, I need to introduce you to Dave from Boina Band.
He's, um, a tad bit good at rapping, I guess.
So when I first trained Karpathy's RNN to generate rap lyrics in 2017, I invited him over to read the lyrics my algorithm had written.
But then, I lost the footage, and then he lost the footage, and well, long story short, there's no footage of it ever happening.
That made me bummed for a bit, but then I realized this could be interpreted as a sign from above.
Perhaps the AI prevented us humans from rapping its song because it wanted to do the rap itself.
Well, Computery, if you insist.
To give Computery a voice, I downloaded this Python module that lets us use Google's text-to-speech software directly.
I'm pretty sure you've heard this text-to-speech voice before.
Now, as we hear Computery's awesome rap, I'm gonna show the lyrics on screen.
If you're up for it, you viewers out there can sing along too.
Alright, let's drop this track.
Wait, why aren't you singing along?
Why aren't you- The reason it performed so badly is because it hasn't had any training data to learn from.
So let's go find some training data.
With my brother's help, I used a large portion of the original hip-hop lyrics archive as my dataset to train my algorithm on.
This includes works by rap giants like Kendrick Lamar and Eminem.
We stitched around 6,000 songs into one giant text file, separate with line breaks, to create our final dataset of 17 million text characters.
Wait, that's only 17 megabytes.
A single 4-minute video typically takes up more space than that.
Yeah, it turns out that text, as a data type, is incredibly dense.
You can store a lot of letters in the same amount of space as a short video.
Let's see the algorithm learn.
Okay, ready?
Go.
Stop.
As you can see, after just 200 milliseconds, less than a blink of an eye, it learned to stop putting spaces everywhere.
In the dataset, you'll rarely see more than two spaces in a row, so it makes sense that the AI would learn to avoid doing that too.
However, I can see it still putting in uncommon patterns like double I's and capital letters in the middle of words, so let's keep training to see if it learns to fix that.
We're half a second into training now, and the pesky double I's seem to have vanished.
The AI has also drastically shortened the length of its lines, but behind the scenes, that's actually caused by an increase of the frequency of the line break character.
For the AI, the line break is just like any other text character.
However, to match the dataset, we need a good combination of both line breaks and spaces, which we actually get in the next iteration.
And here, we see the AI's first well-formatted word, it.
Wait, does eco count as a word?
Not sure about that.
Oh my gosh, you guys, Future Kerry here.
I realize that's not an uppercase I, it's a lowercase L.
Major 2011 vibes.
Now at one full second into training, we see the AI has learned that commas are often not followed by letters directly.
There should be a space or a line break afterwards.
By the way, the average human reads at 250 words per minute, so a human learning how to rap alongside the AI has currently read four words.
I'm gonna let it run in the background as I talk about other stuff, so one thing I keep getting asked is, what is loss?
Basically, when a neural network makes a guess about what the next letter is gonna be, it assigns a probability to each letter type.
And loss just measures how far away those probabilities were from the true answer given by the data set, on average.
So, lower loss usually means the model can predict true rap lyrics better.
Now I'm playing the training time-lapse ten times faster.
The loss function actually held pretty constant for the first 18 seconds, then it started to drop.
That big drop corresponds to the text looking much more English, with the lines finally beginning to start with capital letters, took long enough, and common words like you, I, and the making their first appearance.
By 54 seconds, I'd say about half of the words are real, so rudimentary grammar rules can start forming.
Of the is one of the most common bigrams in the English language, and here it is.
Also, apostrophes are starting to be used for contractions, and we're seeing the origins of one-word interjections.
Over a minute in, we see the square bracket format start showing up.
In the data set, square brackets were used to denote which rapper was speaking at any given time.
So that means our baby AI's choice of rappers are Goohikomi, Moth, and Burstdogrelacy.
I also want to quickly point out how much doing this relies on the memory I described earlier.
As Andre's article shows, certain neurons of the network have to be designated to fire only when you're inside the brackets, to remember that you have to close them at some point to avoid bracket imbalance.
Okay, this is the point in the video where I have to discuss swear words.
I know a good chunk of my audience is children, so typically I'd censor this out.
However, given the nature of a rap data set, I don't think it's possible to accurately judge the neural network's performance if we were to do that.
Besides, I've included swears in my videos before, people just didn't notice.
But that means, if you're a kid under legal swearing age, I'm kindly asking you to leave to preserve your precious ears.
But if you won't leave, I'll have to scare you away.
Ready?
With that being said, there is one word that's prevalent in raps that I don't think I'm in the position to say, and dang it, why is this glue melting?
Okay, well I'm pretty sure we all know what word I'm talking about, so in the future I'm just going to replace all occurrences of that word with ninja.
After two minutes, it's learned to consistently put two line breaks in between stanzas, and the common label chorus is starting to show up.
Correctly.
Also, did you notice the mysterious line, That doesn't sound like a rap lyric.
Well, it's not.
It appeared 1172 times in the dataset as part of the header of every song that the webmaster transcribed.
Now over the next 10 minutes, the lyrics gradually got better.
It learned more intricate grammar rules, like that motherfucking should be followed by a noun, but the improvements became less and less significant.
So what you see around 10 minutes is about as good as it's gonna get.
After all, I set the number of synapses to a constant 5 million, and there's only so much information you can fit in 5 million synapses.
Anyway, I ran the training overnight and got it to produce this 600-line file.
If you don't look at it too long, you could be convinced they're real lyrics.
Patterns shorter than a sentence are replicated pretty well, but anything longer is a bit iffy.
There are a few one-liners that came out right, like, The lines that are a little wonky, like, Oh, I also like it when it switches into shrieking mode, but anyway, we can finally feed this into Google's text-to-speech to hear it rap once and for all.
Hold on, that was actually pretty bad.
The issue here is we gave our program no way to implement rhythm, which, in my opinion, is the most important element to making a rap flow.
So how do we implement this rhythm?
Well, this is the twist I mentioned earlier in the video.
There's two methods.
Method one would be to manually time-stretch and time-squish syllables to match a pre-picked rhythm using some audio-editing software.
For this, I picked my brother's song, 3,000 Subbies, and I also used Melodyne to auto-tune each syllable to the right pitch so it's more of a song. oooooooooooo Although, that's not required for rap.
So, how does the final result actually sound?
I'll let you be the judge!
Looking like that break-in, them bitches bitches riding alone outside, why don't you get up now and guess what you think?
This is a breakout!
Now haters who have costs, must like what that's the pity!
Just ask a body!
Take a lot of shit!
Eat all the ninja!
Wow I think this sounded pretty fun, and I'm impressed with Google's vocal range.
However, it took me two hours to time align everything, and the whole reason we used AI was to have a program to automatically generate our rap songs.
So we've missed the whole point.
That means we should focus on method two, automatic, algorithmic time alignment.
How do we do that?
Well firstly, notice that most rap background tracks are in the time signature 4-4 or some multiple of it.
Subdivisions of beats, as well as full stanzas, also come in powers of two.
So all rhythms seem to depend closely on this exponential series.
My first approach was to detect the beginning of each spoken syllable and quantize, or snap, that syllable to the nearest half beat.
That means syllables will sometimes fall on the beat, just like this.
But even if it fell off the beat, we'd get cool syncopation, just like this, which is more groovy.
Does this work?
Actually, no.
Because it turns out, detecting the beginning of syllables from waveforms is not so easy.
Some sentences, like, come at me, bro, are super clear.
But others, like, hallelujah, our auroras are real, are not so clear.
And I definitely don't want to have to use phoneme extraction.
It's too cumbersome.
So here's what I actually did.
I cut corners.
Listening to lots of real rap, I realized the most important syllables to focus on were the first and last syllables of each line, since they anchor everything in place.
The middle syllables can fall haphazardly, and the listener's brain will hopefully find some pattern in there to cling to.
Fortunately, human brains are pretty good at finding patterns where there aren't any.
So, to find where the first syllable started, I analyzed where the audio amplitude first surpassed 0.2.
And for the last syllable, I found when the audio amplitude last surpassed 0.2, and literally subtracted a fifth of a second from it.
That's super janky, and it doesn't account for these factors, but it worked in general.
From here, I snapped those two landmarks to the nearest beat, time dilating or contracting as necessary.
Now, if you squish audio the rudimentary way, you also affect its pitch, which I don't want.
So, I instead used the phase vocoder of the Python library AudioTSM to edit timing without affecting pitch.
Now, instead of this, Just tell me I'm fuckin' right Weak, stathered, I please Mobs help All in line in them stars Holla We get this.
Just tell me I'm fuckin' right Weak, stathered, I please Mobs help All in line in them stars Holla That's pretty promising.
We're almost at my final algorithm, but there's one final fix.
Big downbeats, which occur every 16 normal beats, are especially important.
Using our current method, Google's TTS will just run through them like this.
Not only is that clunky, it's just plain rude.
So, I added a rule that checks if the next book in line will otherwise run through the big downbeat, and if so, it will instead wait for that big downbeat to start before speaking.
This is better, but we've also created awkward silences.
So, to fix that, I introduced a second speaker.
It's me.
Google text-to-speech pitch down 30%.
When speaker 1 encounters an awkward silence, speaker 2 will fill in by echoing the last thing speaker 1 said, and vice versa.
What we get from this is much more natural.
Alright, so that's pretty much all I did for rhythm alignment, and it vastly improves the flow of our raps.
I think it's time for you to hear a full-blown song that this algorithm generated.
Are you ready to experience Computery's first single?
I know I sure am.
I'm in the later I can.
I want to play the battle, so I don't know she won it this, and I don't fuck with X Rez I mog OS.
It's been all out on this booty beggle bove.
Chorus, Eminem.
Clean, Busta Rhymes.
Gangsta, bitch, come cock wild.
Stop the movie, F5.
Dookie to that.
Four asterisks.
That's four asterisks.
Kept the naked party dead right.
Remember why I need them in the eyes.
Spreadin' with the same other ninja. 137 wave is on the glinty.
Shoot out to charge help up your crowd.
Out to charge help up your crowd.
That ain't foul.
What the fuck?
You're getting cheap.
Chorus, Busta Rhymes.
They say stick around, and today's a season.
Busta Rhymes.
Hip-hop traded large digidel.
Traded large digidel.
Brought my site down with a record.
I can't be back to the motherfuckin' beggle.
Bitch, and when I help you, shit in this at school.
So beside that, with the universe in the baseball.
Universe in the baseball.
Cuz I don't go to the rag.
At all when I russet.
It ain't no rowdy body touch like I supposed to work it.
Pimpy, but I study your tech just to make no slow.
Snoop Dogg.
I'm a light, don't post rolls, but a ton of meat.
So when you sell the motherfuckin' body.
Chorus, Bizwerky.
You tell me what you feelin', but I'm tryin' to fight cuz I'm the city.
When I grip head at you is my fate.
I got slick in the cocks.
You're girls all up and down.
Stay body to be a cock, beat the mawfuckin' you.
Mawfuckin' you.
Weed the ball time with my rhyme faster kicks.
It's all a da-sa-da.
Give one rinit, just stay right.
Armorn up, peep boy.
Remember the famine, carry the pain.
I'm scared of B-W-W-W when you get well quick to be.
It's my brother until it's and I gone.
I'll leave the handle back about a ninja home.
A ninja home.
So we went the two at Nuva Place.
Question mark.
Question mark.
And I'm the doe you know it really knows.
G-A-L-Y-S-H-O-G-I-G.
We put with the profack they quit.
Spit the bang vocals.
Tie the e-skate and MC let the money have heat.
Up in the court, yup, with the motherfuckin' lips.
Quote.