Subtitles section Play video Print subtitles Are you surprised at the advances that have come in the last several years? Oh, yes, definitely. I didn’t imagine it would become this impressive. What’s strange to me, is that we create these models, but we don’t really understand how the knowledge is encoded. To see what’s in there, it’s almost like a black box, although we see the innards, and so understanding why it does so well, or so poorly, we’re still pretty naive. One thing I’m really excited about is our lack of understanding on both types of intelligence, artificial and human intelligence. It really opens new intellectual problems. There’s something odd about how these large language models, that we often call LLMs, acquire knowledge in such an opaque way. It can perform some tests extremely well, while surprising us with silly mistakes somewhere else. It’s been interesting that, even when it makes mistakes, sometimes if you just change the prompt a little bit, then all of a sudden, even that boundary is somewhat fuzzy, as people play around. Totally. Quote-unquote "prompt engineering" became a bit of a black art where some people say that you have to really motivate the transformers in the way that you motivate humans. One custom instruction that I found online was supposed to be about how you first tell LLM’s “you are brilliant at reasoning, you really think carefully,” then somehow the performance is better, which is quite fascinating. But I find two very divisive reactions to the different results that you can get from prompt engineering. On one side, there are people who tend to focus primarily on the success case. So long as there is one answer that is correct, it means the transformers, or LLMs, do know the correct answer; it’s your fault that you didn’t ask nicely enough. Whereas there is the other side, the people who tend to focus a lot more on the failure cases, therefore nothing works. Both are some sort of extremes. The answer may be somewhere in between, but this does reveal surprising aspects of this thing. Why? Why does it make these kinds of mistakes at all? We saw a dramatic improvement from the models the size of GPT-3 going up to the size of ChatGPT-4. I thought of 3 as kind of a funny toy, almost like a random sentence generator that I wrote 30 years ago. It was better than that, but I didn’t see it as that useful. I was shocked that ChatGPT-4 used in the right way can be pretty powerful. If we go up in scale, say another factor of 10 or 20 above GPT-4, will that be a dramatic improvement, or a very modest improvement? I guess it’s pretty unclear. Good question, Bill. I honestly don’t know what to think about it. There’s uncertainty, is what I’m trying to say. I feel there’s a high chance that we’ll be surprised again, by an increase in capabilities. And then we will also be really surprised by some strange failure modes. More and more, I suspect that the evaluation will become harder, because people tend to have a bias towards believing the success case. We do have cognitive biases in the way that we interact with these machines. They are more likely to be adapted to those familiar cases, but then when you really start trusting it, it might betray you with unexpected failures. Interesting time, really. One domain that is almost counterintuitive that it’s not as good at is mathematics. You almost have to laugh that something like a simple Sudoku puzzle is one of the things that it can’t figure out, whereas even humans can do that. Yes, it’s like reasoning in general, that humans are capable of, that these ChatGPT are not as reliable right now. The reaction to that in the current scientific community, it’s a bit divisive. On one hand, that people might believe that with more scale, the problems will all go away. Then there’s the other camp who tend to believe that, wait a minute, there’s a fundamental limit to it, and there should be better, different ways of doing it that are much more efficient. I tend to believe the latter. Anything that requires a symbolic reasoning can be a little bit brittle. Anything that requires a factual knowledge can be brittle. It’s not a surprise when you actually look at the simple equation that we optimize for training these larger language models because, really, there’s no reason why suddenly such capability should pop out. I wonder if the future architecture may have more of a self-understanding of reusing knowledge in a much richer way than just this forward-chaining set of multiplications. Yes, right now the transformers, like GPT-4, can look at such a large amount of context. It’s able to remember so many words as spoken just now. Whereas humans, you and I, we both have a very small working memory. The moment we hear new sentences from each other, we kind of forget exactly what you said earlier, but we remember the abstract of it. We have this amazing capability of abstracting away instantaneously and have such a small working memory, whereas right now GPT-4 has enormous working memory, so much bigger than us. But I think that’s actually the bottleneck, in some sense, hurting the way that it’s learning, because it’s just relying on the patterns, a surface of patterns overlay, as opposed to trying to abstract away the true concepts underneath any text. Subscribe to ”Unconfuse Me” wherever you listen to podcasts.
B1 US working memory tend surprised prompt reasoning knowledge “There’s a high chance we’ll be surprised again by AI” | Unconfuse Me with Bill Gates 39 1 Q San posted on 2023/11/27 More Share Save Report Video vocabulary