Subtitles section Play video
cSo I wanted to make a video about
GPT - 2
Because it's been in the news recently
this very powerful language model from open AI and I thought it would make sense to start by just doing a video about
transformers and language models in general because
GPT 2 is a very large
Language model implemented as a transformer, but you have a previous video about generating YouTube comments, which is the same kind of task, right?
That's a language modeling task from language processing to generate new samples for cooling of the most complex or magnetic
Consistent brackets like a computer to expect found in creating organizations
I believe that video was made October 2017 and this paper came out December 2017, which has kind of
Revolutionized the way that people carry out that kind of task. That's not the GPT - 2 that's something before that, right?
That's the transformer, which is a new realm. Yeah relatively new
architecture
for neural networks, that can do actually all kinds of tasks, but they're especially good at this kind of
language modeling task
a language model is a probability distribution over like sequences of
Tokens or symbols or words or whatever in a language?
So for any given like sequence of tokens, it can tell you how likely that is
So if you have a good language model of English
It can look at a sequence of you know words or characters or whatever and say how likely that is to occur in English
How likely that is to be an English phrase or sentence or whatever
And when you have that you can use that for a lot of different tasks. So
If you want to generate text, then you can you can just sort of sample from that distribution and keep giving it
its own output
so you you sample a word and then you say
And to be clear sampling from a distribution means you're just taking
Your you're sort of rolling the dice on that probability distribution and taking whichever one comes out. So
so you can like sample a word and then
And then say okay conditioning on that given that the first word of this sentence is V
What does the probability distribution look like for the second word?
And then you sample from that distribution and then it's you know
with a cat and you say given that it's the cat what's likely to come next and so on so you can you can build
up a
string of text by sampling from
Your distribution that's one of the things you could use it for
most of us kind of have an example of this sort of in our pockets of
Its actual absolutely right and that's like that's the that's the way that most people interact with a language model
I guess this is how I often start a sentence
apparently with I I am not sure if you have any questions or concerns, please visit the
Plugin settings so I can do it for the first time in the future of that's no good
Here's a different option. Let's just see what this way. Maybe the same
I am in the morning
But I can't find it on the phone screen from the phone screen on the phone screen on the phone screen on the phone screen
On the phone screen. I don't actually know how this is implemented
it might be a neural network, but my guess is that it's some kind of
like Markov model Markov chain type setup where you just
for each word in your language you look at your data set and you see how often a particular
how often each other word is
Following that word and then that's how you build your distribution
So like for the word "I" the most common word to follow that is "am" and there are a few others, you know
so this is like a very simple model and
This sentence on the phone screen on the phone screen on the phone screen on the phone screen on the phone screen
He's actually very unlikely, right?
This is the super low probability sentence where I would somebody type this and the thing is it's like myopic
It's only I'm not sure I even it's probably only looking at the previous word
It might be looking at like the previous two words, but the problem is to look back. It becomes extremely expensive
Computationally expensive right?
Like you've got I don't know 50,000 words that you might be looking at and so then it so you're you're you're remembering
50,000 probability distributions or
50,000 top three words
but you know then if you want to do
2, that's 50,000 squared right and if you want to go back three words
You have to cube it. So you like raising it to the power of the number of words back you want to go which is
Which means that this type of model?
Basically doesn't look back by the time we're saying on the it's already forgotten the previous time
It said on the it doesn't realize that it's repeating itself and there are slightly better things you can do in this general area
But like fundamentally if you don't remember you're not going to be able to make good sentences
If you can't remember the beginning of the sentence by the time you're at the end of it, right?
and
so
One of the big areas of progress in language models is handling long term dependencies
I mean handling dependencies of any kind but especially long term dependencies
You've got a sentence that's like Shawn came to the hack space to record a video and I talked to
Blank right in that situation if your model is good
you're expecting like a pronoun probably so it's it's she they
You know them whatever and but the relevant piece of information is the words short
Which is like all the way at the beginning of the sentence
so your model needs to be able to say oh, okay, you know Shawn that's
Usually associated with male pronouns, so we'll put the male pronoun in there. And if your model doesn't have that ability to look back
Or to just remember what it's just said then
You end up with these sentences that?
Like go nowhere
It's just a slight like it might make a guess
just a random guess at a pronoun and might get it wrong or it might just
and I talked to and then just be like
Frank, you know just like introduced a new name because it's guessing at what's likely to come there and it's completely forgotten that sure was
Ever like a thing. So yeah, these kind of dependencies are a big issue with things that you would want to language model to do
But we've only so far talked about
Language models for generating text in this way, but you can also use them for all kinds of different things. So like
people use language models for translation
Obviously you have some input sequence that's like in English and you want to output a sequence in French or something like that
Having a good language model is really important so that you end up with something. That makes sense
Summarization is a task that people often want
Where you read in a long piece of text and then you generate a short piece of text. That's like a summary of that
that's the kind of thing that you would use a language model for or
reading a piece of text and then answering questions about that text or
If you want to write like a chatbot that's going to converse with people having a language model as good like basically almost all
like natural language processing
right is it's useful to have this the other thing is
You can use it to enhance
Enhance a lot of other language related tasks
So if you're doing like speech recognition then having a good language model
Like there's a lot of things people can say that sound very similar and to get the right one
You need to be like, oh, well, this actually makes sense, you know
This word. That sounds very similar
Would be incoherent in this sentence. It's a very low probability
It's much more likely that they this thing which is like would flow in the language
And human beings do this all the time same thing
With recognizing text from images, you know
You've got two words that look similar or there's some ambiguity or whatever and to resolve that you need
an
understanding of what word would make sense there what word would fit if you're trying to use a neural network to do the kind of
thing we were talking about before, of having a phone, you know autocorrect based on the previous word or two
Suppose you've got a sequence of two words going in you've got "so" and then "I" and you put
both of these into your network and it will then output, you know
like "said" for example as like a sensible next word and then what you do is you throw away or so and you then
Bring your set around and you make a new
Sequence which is I said and then put that into your network and it will put out
like I said - for example would make sense and so on and you keep going around, but the problem is
This length is really short you try and make this long enough to contain an entire
Sentence just an ordinary length sentence and this problem starts to become really really hard
And networks have a hard time learning it and you don't get very good performance
and even then
You're still like have this absolute hard limit on how long a thing you you have to just pick a number
That's like how far back am I looking a better thing to do you say recurring neural network? Where you
You give the thing. Let's like divide that up
So in this case, then you have a network you give it this vector?
You just like have a bunch of numbers which is gonna be like the memory
for that network is the idea like the problem is it's forgotten in the beginning of the sentence by the time it gets to the
end so we've got to give it some way of remembering and
rather than feeding it the entire sentence every time you give it this vector and
you give it to just one word at a time of your inputs and
This vector, which you initialize I guess with zeros. I want to be clear
This is not something that I've studied in a huge amount of detail
I'm just like giving the overall like structure of the thing. But the point is you give it this vector and the word and
it outputs its guess for the next word and also a
Modified version of that vector that you then for the next thing you give it
where did it spit out or the sequence that it spit out and
Its own modified version of the vector every cycle that goes around. It's modifying this memory
Once this system is like trained very well
If you give it if you give it the first word Shawn then part of this vector is going to contain some
information that's like this subject of this sentence is the word short and
some other part will probably keep track of like
We expect to use a male pronoun for this sentence and that kind of thing
So you take this and give it to that and these are just two instances of the same network, and then it keeps going
every time
So it spits out like this is I so then the AI also comes around to here you might then put outside and so on
But it's got this continuous thread of
of memory effectively going through because it keeps passing the thing through in principle if it figures out something important at the beginning of
You know
The complete works of Shakespeare that it's generating. There's nothing
Strictly speaking stopping that from persisting from being passed through
From from iteration to iteration to iteration every time
In practice, it doesn't work that way because in practice
The whole thing is being messed with by the network on every step and so in in the training process it's going to learn
That it performs best when it leaves most of it alone and it doesn't just randomly change the whole thing
But by the time you're on the fiftieth word of your sentence
whatever the network decided to do on the first word of the sentence is a
photocopy of a photocopy of a photocopy of a photocopy and so
things have a tendency to
Fade out to nothing. It has to be successfully remembered at every step of this process
and if at any point it gets overwritten with something else or just
It did its best to remember it but it's actually remembering 99% of it each time point nine
Nine to the fifty is like actually not that big of a number
So these things work pretty well, but they still get the performance like really quickly drops off once the sentences start to get long
So this is a recurrent neural network
rnl because all of these boxes
Are really the same box because this is the same network at different time steps. It's really a loop like this
You're giving the output of the network back as input every time so this works better and then people have tried all kinds of interesting
Things things like LS TMS. There's all kinds of variants on this general like recurrent Network
LS TM is the thing. That might use isn't it? Right right long short-term memory, which is kind of surreal
But yeah, so the idea of that is it's a lot more complicated inside these networks
There's actually kind of sub networks that make specific decisions about gating things. So
Rather than having to have this system learn that it ought to pass most things on it's sort of more in the architecture that passes
most things on and then there's a there's a sub there's like part of the learning is
Deciding what to forget
At each step and like deciding what to change and what to put it in what parcel and so on and they perform better
They can hang on to the information the relevant information for longer
But the other thing that people often build into these kinds of systems is something called attention
Which is actually a pretty good metaphor
Where in the same way that you would have?
networks that decide which parts of your hidden state to hang on to or which starts to forget or
Those kinds of decisions like gating and stuff
You have a system which is deciding which parts of the input to pay attention to which parts to use in
The in the calculation and which parts to ignore and this turns out to be actually very powerful. So there was this paper
When was this?
2000
2017. Yeah, so this is funny because this came out the same year as
The video you have about generating YouTube comments. This is in December. I think that video was October ancient history now
Alright, we're talking two years ago. The idea of this is as its called attention is all you need. They developed this system. Whereby
it's actually as
it's a lot simpler as a
As a network you can see on the diagram here if you compare this to the diagram for an LS TM or
Any of those kind of variants? It's relatively simple and it's just kind of using attention to do everything
So when made that video the ASTM type stuff was like state-of-the-art and that was until a couple of months later
I guess when this paper came out the idea of this is that attention is all you need of it like this stuff about
having gates for forgetting things and
All of that all of that kind of stuff in fact your whole recurrence like architecture
you can do away with it and just use attention attention is powerful enough to
do everything that you need at its base attention is about actively deciding in the same way that
the LS TM is actively deciding what to forget and so on this is deciding which parts of
some other part of the data it's going to
take into account which parts it's going to look at like it can be very dangerous in AI to
use words for things that are words that people already use
For the way that humans do things. It makes it very easy transform for more finds and just
make, you know get confused because the abstraction doesn't quite work but I think attention is a pretty decent thing because it is
It does make sense
It sort of draws the relationships between things so you can have attention from the output to the input
Which is what that would be you can also have attention from the output to other parts of the output
so for example when I'm generating in that sentence like
Shawn came to record a video or whatever by the time I get to generating the word him
I don't need to be thinking about the entire sentence
I can just focus my attention on where I remember
The name was so the attention goes to Shawn and then I can make the decision for to use the word him based on
that
so
so rather than having to hang on to a huge amount of memory you
Can just selectively look at the things that are actually relevant and the system learns
Where to look where to pay attention to and that's really cool like you can do it
There's attention based systems for all kinds of things like not just text you can do
Like suppose you have your input is like an image and you want to caption it
You can actually look at when it was outputting the sequence you can say when you generated the word dog
What was your you can get like an attention heat map and it will highlight the dog
Because that's the part of the image that it was paying attention to when it generated that output
It makes your system more interpretable because you can see what it was thinking and sometimes you can catch problems that way as well
which is kind of fun like
It generates the output that's like a man is lifting a dumbbell or something like that and you look at it
And it's not actually correct. It's like its owner trots and I go he's drinking some tea out of a mug, right and
what you find is then when you look at your
Outputs where it says dumbbell you look at the attention and the attention is like mostly looking at the arms. That's usually somebody muscular
Who's lifting the dumbbell in your photos?
It's and so it it's overriding the fact that this kind of looks like a mug because it was looking at the arms
So the idea is this system which is called a transformer is a type of neural network
which just relies very heavily on attention to
Produce like state-of-the-art performance and if you train them on a large
corpus of natural language they can learn
They can learn to do very well, right they give you they can be very powerful language models
We had the example of a language model on your phone
That's like a very very basic and then trying to do this with neural networks and the problems with remembering
And so you have like recurrent systems that keep track of they allow you to pass memory along so that you can remember the beginning
of the sentence at least by the end of it and
Things like LSTMs there is all these different varieties that people try different things
That are better and hanging on to memory so that they can do better it they can have longer term
Dependencies, which allows you to have more coherent
outputs
in just generally better performance, and then the transformer is
Is a variant on that?
Well is a different way of doing things where you really focus on attention. And so these are actually not recurrent which is an
important distinction to make we don't have this thing of like
Taking the output and feeding that back as the input and so on every time
Because we have attention. We don't need to keep a big memory
That we run through every time when the system wants to know something it can use its attention to look back to that part
It's not like memorizing the text as it goes. It's
paying attention to different bits of the text as
they as it thinks that they're relevant to the bit that it's looking at now and
The thing about that is when you have this recurrent thing
It's kind of inherently serial
most of the calculations for this you can't do them until you have
The inputs and the inputs are the output of the previous network. And so
You can't do the thing that people like to do now, which is run it on a million computers
And get lightning-fast performance because you have to go through them in order right? It's like inherently serial
Where as transformers are much more parallelizable, which means you get better computational performance out of them as well?
Which is another
Selling point so they they work better and they run faster. So they're they're really a
Step up. So transformers. Are this really powerful
architecture. They seem to give really good performance on this kind of sort of language modeling type tasks and
we
But what we didn't know really was how far you can push them or how how good they can get
What happens if you take this architecture and you give it a bigger data set than any of them has ever been given and more?
Compute to train with, you know, a larger model with more parameters and more data
How good can these things get how how good a language model?
Can you actually make and that's what opening I was doing with GPT 2?
So an executable binary the net effect of slotting that T diagram against here slightly downwards is to show you
That the C you've written gets converted into binary and the net output from this
process it produces out a program that you probably store in a