Subtitles section Play video Print subtitles cSo I wanted to make a video about GPT - 2 Because it's been in the news recently this very powerful language model from open AI and I thought it would make sense to start by just doing a video about transformers and language models in general because GPT 2 is a very large Language model implemented as a transformer, but you have a previous video about generating YouTube comments, which is the same kind of task, right? That's a language modeling task from language processing to generate new samples for cooling of the most complex or magnetic Consistent brackets like a computer to expect found in creating organizations I believe that video was made October 2017 and this paper came out December 2017, which has kind of Revolutionized the way that people carry out that kind of task. That's not the GPT - 2 that's something before that, right? That's the transformer, which is a new realm. Yeah relatively new architecture for neural networks, that can do actually all kinds of tasks, but they're especially good at this kind of language modeling task a language model is a probability distribution over like sequences of Tokens or symbols or words or whatever in a language? So for any given like sequence of tokens, it can tell you how likely that is So if you have a good language model of English It can look at a sequence of you know words or characters or whatever and say how likely that is to occur in English How likely that is to be an English phrase or sentence or whatever And when you have that you can use that for a lot of different tasks. So If you want to generate text, then you can you can just sort of sample from that distribution and keep giving it its own output so you you sample a word and then you say And to be clear sampling from a distribution means you're just taking Your you're sort of rolling the dice on that probability distribution and taking whichever one comes out. So so you can like sample a word and then And then say okay conditioning on that given that the first word of this sentence is V What does the probability distribution look like for the second word? And then you sample from that distribution and then it's you know with a cat and you say given that it's the cat what's likely to come next and so on so you can you can build up a string of text by sampling from Your distribution that's one of the things you could use it for most of us kind of have an example of this sort of in our pockets of Its actual absolutely right and that's like that's the that's the way that most people interact with a language model I guess this is how I often start a sentence apparently with I I am not sure if you have any questions or concerns, please visit the Plugin settings so I can do it for the first time in the future of that's no good Here's a different option. Let's just see what this way. Maybe the same I am in the morning But I can't find it on the phone screen from the phone screen on the phone screen on the phone screen on the phone screen On the phone screen. I don't actually know how this is implemented it might be a neural network, but my guess is that it's some kind of like Markov model Markov chain type setup where you just for each word in your language you look at your data set and you see how often a particular how often each other word is Following that word and then that's how you build your distribution So like for the word "I" the most common word to follow that is "am" and there are a few others, you know so this is like a very simple model and This sentence on the phone screen on the phone screen on the phone screen on the phone screen on the phone screen He's actually very unlikely, right? This is the super low probability sentence where I would somebody type this and the thing is it's like myopic It's only I'm not sure I even it's probably only looking at the previous word It might be looking at like the previous two words, but the problem is to look back. It becomes extremely expensive Computationally expensive right? Like you've got I don't know 50,000 words that you might be looking at and so then it so you're you're you're remembering 50,000 probability distributions or 50,000 top three words but you know then if you want to do 2, that's 50,000 squared right and if you want to go back three words You have to cube it. So you like raising it to the power of the number of words back you want to go which is Which means that this type of model? Basically doesn't look back by the time we're saying on the it's already forgotten the previous time It said on the it doesn't realize that it's repeating itself and there are slightly better things you can do in this general area But like fundamentally if you don't remember you're not going to be able to make good sentences If you can't remember the beginning of the sentence by the time you're at the end of it, right? and so One of the big areas of progress in language models is handling long term dependencies I mean handling dependencies of any kind but especially long term dependencies You've got a sentence that's like Shawn came to the hack space to record a video and I talked to Blank right in that situation if your model is good you're expecting like a pronoun probably so it's it's she they You know them whatever and but the relevant piece of information is the words short Which is like all the way at the beginning of the sentence so your model needs to be able to say oh, okay, you know Shawn that's Usually associated with male pronouns, so we'll put the male pronoun in there. And if your model doesn't have that ability to look back Or to just remember what it's just said then You end up with these sentences that? Like go nowhere It's just a slight like it might make a guess just a random guess at a pronoun and might get it wrong or it might just and I talked to and then just be like Frank, you know just like introduced a new name because it's guessing at what's likely to come there and it's completely forgotten that sure was Ever like a thing. So yeah, these kind of dependencies are a big issue with things that you would want to language model to do But we've only so far talked about Language models for generating text in this way, but you can also use them for all kinds of different things. So like people use language models for translation Obviously you have some input sequence that's like in English and you want to output a sequence in French or something like that Having a good language model is really important so that you end up with something. That makes sense Summarization is a task that people often want Where you read in a long piece of text and then you generate a short piece of text. That's like a summary of that that's the kind of thing that you would use a language model for or reading a piece of text and then answering questions about that text or If you want to write like a chatbot that's going to converse with people having a language model as good like basically almost all like natural language processing right is it's useful to have this the other thing is You can use it to enhance Enhance a lot of other language related tasks So if you're doing like speech recognition then having a good language model Like there's a lot of things people can say that sound very similar and to get the right one You need to be like, oh, well, this actually makes sense, you know This word. That sounds very similar Would be incoherent in this sentence. It's a very low probability It's much more likely that they this thing which is like would flow in the language And human beings do this all the time same thing With recognizing text from images, you know You've got two words that look similar or there's some ambiguity or whatever and to resolve that you need an understanding of what word would make sense there what word would fit if you're trying to use a neural network to do the kind of thing we were talking about before, of having a phone, you know autocorrect based on the previous word or two Suppose you've got a sequence of two words going in you've got "so" and then "I" and you put both of these into your network and it will then output, you know like "said" for example as like a sensible next word and then what you do is you throw away or so and you then Bring your set around and you make a new Sequence which is I said and then put that into your network and it will put out like I said - for example would make sense and so on and you keep going around, but the problem is This length is really short you try and make this long enough to contain an entire Sentence just an ordinary length sentence and this problem starts to become really really hard And networks have a hard time learning it and you don't get very good performance and even then You're still like have this absolute hard limit on how long a thing you you have to just pick a number That's like how far back am I looking a better thing to do you say recurring neural network? Where you You give the thing. Let's like divide that up So in this case, then you have a network you give it this vector? You just like have a bunch of numbers which is gonna be like the memory for that network is the idea like the problem is it's forgotten in the beginning of the sentence by the time it gets to the end so we've got to give it some way of remembering and rather than feeding it the entire sentence every time you give it this vector and you give it to just one word at a time of your inputs and This vector, which you initialize I guess with zeros. I want to be clear This is not something that I've studied in a huge amount of detail I'm just like giving the overall like structure of the thing. But the point is you give it this vector and the word and it outputs its guess for the next word and also a Modified version of that vector that you then for the next thing you give it where did it spit out or the sequence that it spit out and Its own modified version of the vector every cycle that goes around. It's modifying this memory Once this system is like trained very well If you give it if you give it the first word Shawn then part of this vector is going to contain some information that's like this subject of this sentence is the word short and some other part will probably keep track of like We expect to use a male pronoun for this sentence and that kind of thing So you take this and give it to that and these are just two instances of the same network, and then it keeps going every time So it spits out like this is I so then the AI also comes around to here you might then put outside and so on But it's got this continuous thread of of memory effectively going through because it keeps passing the thing through in principle if it figures out something important at the beginning of You know The complete works of Shakespeare that it's generating. There's nothing Strictly speaking stopping that from persisting from being passed through From from iteration to iteration to iteration every time In practice, it doesn't work that way because in practice The whole thing is being messed with by the network on every step and so in in the training process it's going to learn That it performs best when it leaves most of it alone and it doesn't just randomly change the whole thing But by the time you're on the fiftieth word of your sentence whatever the network decided to do on the first word of the sentence is a photocopy of a photocopy of a photocopy of a photocopy and so things have a tendency to Fade out to nothing. It has to be successfully remembered at every step of this process and if at any point it gets overwritten with something else or just It did its best to remember it but it's actually remembering 99% of it each time point nine Nine to the fifty is like actually not that big of a number So these things work pretty well, but they still get the performance like really quickly drops off once the sentences start to get long So this is a recurrent neural network rnl because all of these boxes Are really the same box because this is the same network at different time steps. It's really a loop like this You're giving the output of the network back as input every time so this works better and then people have tried all kinds of interesting Things things like LS TMS. There's all kinds of variants on this general like recurrent Network LS TM is the thing. That might use isn't it? Right right long short-term memory, which is kind of surreal But yeah, so the idea of that is it's a lot more complicated inside these networks There's actually kind of sub networks that make specific decisions about gating things. So Rather than having to have this system learn that it ought to pass most things on it's sort of more in the architecture that passes most things on and then there's a there's a sub there's like part of the learning is Deciding what to forget At each step and like deciding what to change and what to put it in what parcel and so on and they perform better They can hang on to the information the relevant information for longer But the other thing that people often build into these kinds of systems is something called attention Which is actually a pretty good metaphor Where in the same way that you would have? networks that decide which parts of your hidden state to hang on to or which starts to forget or Those kinds of decisions like gating and stuff You have a system which is deciding which parts of the input to pay attention to which parts to use in The in the calculation and which parts to ignore and this turns out to be actually very powerful. So there was this paper When was this? 2000 2017. Yeah, so this is funny because this came out the same year as The video you have about generating YouTube comments. This is in December. I think that video was October ancient history now Alright, we're talking two years ago. The idea of this is as its called attention is all you need. They developed this system. Whereby it's actually as it's a lot simpler as a As a network you can see on the diagram here if you compare this to the diagram for an LS TM or Any of those kind of variants? It's relatively simple and it's just kind of using attention to do everything So when made that video the ASTM type stuff was like state-of-the-art and that was until a couple of months later I guess when this paper came out the idea of this is that attention is all you need of it like this stuff about having gates for forgetting things and All of that all of that kind of stuff in fact your whole recurrence like architecture you can do away with it and just use attention attention is powerful enough to do everything that you need at its base attention is about actively deciding in the same way that the LS TM is actively deciding what to forget and so on this is deciding which parts of some other part of the data it's going to take into account which parts it's going to look at like it can be very dangerous in AI to use words for things that are words that people already use For the way that humans do things. It makes it very easy transform for more finds and just make, you know get confused because the abstraction doesn't quite work but I think attention is a pretty decent thing because it is It does make sense It sort of draws the relationships between things so you can have attention from the output to the input Which is what that would be you can also have attention from the output to other parts of the output so for example when I'm generating in that sentence like Shawn came to record a video or whatever by the time I get to generating the word him I don't need to be thinking about the entire sentence I can just focus my attention on where I remember The name was so the attention goes to Shawn and then I can make the decision for to use the word him based on that so so rather than having to hang on to a huge amount of memory you Can just selectively look at the things that are actually relevant and the system learns Where to look where to pay attention to and that's really cool like you can do it There's attention based systems for all kinds of things like not just text you can do Like suppose you have your input is like an image and you want to caption it You can actually look at when it was outputting the sequence you can say when you generated the word dog What was your you can get like an attention heat map and it will highlight the dog Because that's the part of the image that it was paying attention to when it generated that output It makes your system more interpretable because you can see what it was thinking and sometimes you can catch problems that way as well which is kind of fun like It generates the output that's like a man is lifting a dumbbell or something like that and you look at it And it's not actually correct. It's like its owner trots and I go he's drinking some tea out of a mug, right and what you find is then when you look at your Outputs where it says dumbbell you look at the attention and the attention is like mostly looking at the arms. That's usually somebody muscular Who's lifting the dumbbell in your photos? It's and so it it's overriding the fact that this kind of looks like a mug because it was looking at the arms So the idea is this system which is called a transformer is a type of neural network which just relies very heavily on attention to Produce like state-of-the-art performance and if you train them on a large corpus of natural language they can learn They can learn to do very well, right they give you they can be very powerful language models We had the example of a language model on your phone That's like a very very basic and then trying to do this with neural networks and the problems with remembering And so you have like recurrent systems that keep track of they allow you to pass memory along so that you can remember the beginning of the sentence at least by the end of it and Things like LSTMs there is all these different varieties that people try different things That are better and hanging on to memory so that they can do better it they can have longer term Dependencies, which allows you to have more coherent outputs in just generally better performance, and then the transformer is Is a variant on that? Well is a different way of doing things where you really focus on attention. And so these are actually not recurrent which is an important distinction to make we don't have this thing of like Taking the output and feeding that back as the input and so on every time Because we have attention. We don't need to keep a big memory That we run through every time when the system wants to know something it can use its attention to look back to that part It's not like memorizing the text as it goes. It's paying attention to different bits of the text as they as it thinks that they're relevant to the bit that it's looking at now and The thing about that is when you have this recurrent thing It's kind of inherently serial most of the calculations for this you can't do them until you have The inputs and the inputs are the output of the previous network. And so You can't do the thing that people like to do now, which is run it on a million computers And get lightning-fast performance because you have to go through them in order right? It's like inherently serial Where as transformers are much more parallelizable, which means you get better computational performance out of them as well? Which is another Selling point so they they work better and they run faster. So they're they're really a Step up. So transformers. Are this really powerful architecture. They seem to give really good performance on this kind of sort of language modeling type tasks and we But what we didn't know really was how far you can push them or how how good they can get What happens if you take this architecture and you give it a bigger data set than any of them has ever been given and more? Compute to train with, you know, a larger model with more parameters and more data How good can these things get how how good a language model? Can you actually make and that's what opening I was doing with GPT 2? So an executable binary the net effect of slotting that T diagram against here slightly downwards is to show you That the C you've written gets converted into binary and the net output from this process it produces out a program that you probably store in a
A2 sentence language attention model network output AI Language Models & Transformers - Computerphile 13 1 林宜悉 posted on 2020/03/27 More Share Save Report Video vocabulary