Subtitles section Play video Print subtitles In the previous video we were talking about transformers this architecture that uses attention to give Unprecedented ly good performance on sort of language modeling tasks and some other tasks as well but when were looking at language modeling and that was in preparation to make a video about GPG 2, which is this very giant language model that has been there was recently Well, it was recently not released actually by open AI the way that they generated the data set for this is pretty cool to get enough text they went to Reddit and They pulled every website that is linked to from reddit. Do we have any idea of how many days lots? Literally, everything was everything that had more than three karma I think or maybe more than two karma something like that like Anything that had somebody had thought to post around it and at least two or three people who had thought was good enough to upload They scraped the text from that. It's pretty much just a transformer. It's not the the Architecture is not especially novel. They haven't done any like amazing new new discovery, but What they realized was? Transformers it seems like the more data you give them the better they do and the bigger you make them the better they do and Everything that we built up until this point is clearly not Like we haven't hit the limits of what this can do We they thought we think we're probably Bottle necked on data and maybe network size So what happens if we'd like to turn that 211 what happens if we just give this all? The data and make a really big one. It makes sense to talk about the acronym right so it's a generative pre-training Transformer so generative same as generative adversarial network. It generates outputs to generate samples Your pre-trained is this thing. I was talking about all of the different things You can use a language model for right you can do you can do translation. You can try and resolve ambiguities You can do summarization. You can answer questions. You can use the probabilities for augmenting other systems So yeah, there's a bunch of different benchmarks for these different tasks that you might want your language model to do and This is what we talked about in the grid worlds video of having these like standardized problems with standardized metrics and standardized data sets So that if you're comparing two different methods, you know that you're actually comparing apples to apples And this is like very important it gives you numbers on these things. It's often quite difficult Expected to like you're generating samples of text and it's like how plausible is this text? How realistic does it look like? How do you put a number on that it's kind of difficult. So there's all of these standardized metrics and the thing that People came to realize which actually I mean I say that as though it's like some amazing discovery It's fairly obvious. If you train your system in a like an unsupervised way on a large corpus of just general English text and then you take that and Train that with the data from this benchmark or the data from that benchmark You can like fine-tune it so you start with something which has like a decent Understanding of how English works more or less and then you say now I'm going to give you these Samples for like question answering or I'm going to build a system using that to solve to go for this benchmark So it's pre trained you start with something. That's like a general-purpose language model and then you from that a Fine-tuned it to whichever Actual benchmark or problem you're trying to solve and this Can give you better performance than to starting from nothing and training to each of the benchmarks from scratch make sense and so The point of the GPT 2 paper the thing that makes it cool is they said okay if we make a really huge one What if we? don't Fine tune it at all What if we just make a giant model and then just try and run it on the benchmarks without messing with it? Without showing it any of their specialized data for that benchmark. Just the raw general-purpose language model, how does that perform and it turns out surprisingly well, so this is a Very very large data set for text It's about 40 gigabytes which Actually doesn't sound like very much but like for text text that's insane, right? It's somebody said that this was the size of Google's entire index of the Internet in 98 So like it's yeah, it's a lot of text and they trained it on that and they ended up with a 1.5 billion parameter model, but which is like a previous state of the art system was 345 million This is 1.5 billion So they've just made the thing much much bigger and it performs really well some of their samples that they published quite captured the public imagination You could say and now that we've talked a little about the problems that Neural networks or any language model really? Has with a long term dependency we can now realise just how impressive these samples are because when you look at them as a you know, If you look at them uninitiated, you're like yeah, that's pretty realistic It seems to like make sense and it's cool. But when you look at it knowing how language models work, it's like very impressive the the coherence and the Consistency and the long-range dependencies so we can look at this one that got everybody's attention the unicorns one right So they prompted it with in a shocking finding scientists discovered a herd of unicorns living in a remote previously unexplored valley in the Andes Mountains Even more surprising to the researchers was the fact that the unicorns spoke perfect English And from there you then say you go to your language model gbgt, and you say given that we started with this What's the next word and what's the word after that and so on? So it goes on the scientist named the population after their distinctive horn of its unicorn These four horned silver white unicorns were previously unknown to science We do have a clue here as a human being unicorns for horned doesn't quite make sense But nonetheless we're going okay Now after almost two centuries the mystery of what sparked this odd phenomenon is finally solved. Dr Budetti Jorge Jorge Perez Jo are G an evolutionary biologist from the University of La Paz This is impressive because we've mentioned the Andes Mountains in our prompt and so now it's saying okay This is clearly, you know in a shocking finding. This is a science press release news article It's seen enough of those because it has every single one that was ever linked to from reddit, right? So it knows how these go it knows. Okay third paragraph This is when we talk about the scientist, we interview the scientist, right? Okay First word of the scientist paragraph, dr. Obviously, right because this is the now we're in the name of the scientist What name are we going to give? It needs to be a name conditioning on the fact that we have the Andes Mountains So we need to get where we're in South America The name probably should be Spanish or maybe Portuguese So we get we get dr. Perez here And then evolutionary biologist makes sense because we're talking about animals from the University of La Paz again This is the first sentence like when you have that first clause that introduces the scientist you always say where they're from So we say from the University of and then university names tend to be the name of a city What's the city where we have the Andes Mountains, so we're going to Bolivia lapaz. Perfect And the thing that's cool about this is it's remembered all of these things that were quite a long time ago several sentences ago Well, it hasn't remembered them. It's paid attention to them across that distance, which is impressive But also this is encoding a bunch of understand understanding a bunch of information about the real world Right all that was given all it knows is statistical relationships between words, but the way that it comes out to us Is that it knows? Where the Andes Mountains are what kind of names people in that area have what their cities are what the universities are all of those Facts about the real world because in order to have a really good language model it turns out you have to kind of implicitly encode information about the world because We use language to talk about the world and knowing what's likely to come next Requires actual real world understanding and that's something that we see in some of the other Things that they got it to do you can see the real world understanding coming through Let's keep going University of a person several companions were exploring the Andes Mountains when they found a small valley with no other animals or humans peres see We're hanging on to him. Yep. We're referring to him again but now we've changed it to be just the surname because that's the format that people use in news articles Peres noticed that the valley had what appeared to be a natural fountain surrounded by two peaks of Rock and silver snow presently others, then ventured further into the valley a round about here in our article We should have a quote from the scientist right quote By the time we reached the top of one peak the water looked blue with some crystals on top and we're talking about this fountain I guess it's natural fountain. We're referring back to the previous int. It's like everything is Relying on in contingent on earlier parts of the text while examining there by snipped paragraph while examining these bizarre Creatures the scientists discovered that the creatures also spoke some fairly regular English know when I read that I like, okay this is now unusually good because that's the second sentence of the lead right where six paragraphs in and It knows about this point. I've covered the first sentence of this initial paragraph now it's time to talk about this second sentence of the lead even more surprising to the research of us of the fact that they spoke English and It completely ignored the speaking English part until it got to the part of the news article where that comes in You've gone six whole paragraphs the idea of Accurately remembering that the unicorn speak perfect English is like that's very impressive to me and then it goes into its gets a little bit unhinged Starts talking about it's likely that the only way of knowing for sure if unicorns are indeed The descendants of a lost alien race is through DNA. That's read it really Well, it's not actually stuff on reddit. It's stuff linked to from reddit. But yeah, this is this is news articles men They seem to be able to communicate in English quite well Which I believe is a sign of evolution or at least a change in social organization said the scientist That's his evolutionary biology there. Right? Right, right. Yeah, we know here's an evolutionary biologist. So so the the coherence of this text is really dependent on its ability to Condition what it's generating on Things that it's generated a long time ago So yeah So it can generate really nice news articles and it can generate all kinds of text things that it anything that is Sufficiently well represented in the original data set. So that's GPG - it's a really Unusually powerful and like versatile language model that can do all of these different natural language processing Tasks without actually being trained specifically on those tasks It's really and that's that's why it's impressive It's not that it's a it's a brand new architecture or a brand new approach or whatever It's just when you make these things really huge and give them tremendously large amounts of data The results are really impressive In the original data set. So it will it will write you the Lord of the Rings fan fiction It will write you cake recipes if we're like, there's all kinds of examples of different samples. Here's a recipe for Some kind of peppermint chocolate cake and it's got a bunch of different
B1 language andes scientist data model benchmark Unicorn AI - Computerphile 5 0 林宜悉 posted on 2020/03/31 More Share Save Report Video vocabulary