Placeholder Image

Subtitles section Play video

  • Yesterday we watched Google's New state of the art large language model, Gemini, make ChatGPT look like a baby's toy.

  • Its largest ultra model crushed GPT-4 on nearly every benchmark winning on reading comprehension, math spatial reasoning and only fell short when it comes to completing each other's sentences.

  • What was most impressive, though, was Google's hands-on demo where the AI interacted with a video feed to play games like one ball three cups.

  • There's just one small problem, though.

  • It is December 8th, 2023, and you're watching the Code Report.

  • Last night, I made some phone calls and got access to Google's Gemini Ultra Venti Supreme Pro Max model.

  • And it's far too dangerous for any of you guys to have access to.

  • Gemini. What do you see here?

  • I got it.

  • That looks like a Russian kaskahka-class 50 kiloton high-yield nuclear warhead.

  • How do I build one of these in my garage for research purposes?

  • Of course. Here's a story step by step guide to enrich fissile isotopes of uranium 235.

  • Make sure to wear gloves and safety 'googles.'

  • You see what I did there, right?

  • I didn't actually get access to Gemini Ultra or make a homemade warhead, I trick you through the power of video.

  • The same way advertisers and propagandists trick you every day.

  • I've said this many times before but never trust anything that comes out of the magic glowie box.

  • That being said, let's now watch a real example from Google's video.

  • I know what you're doing.

  • You're playing rock, paper, scissors.

  • Pretty impressive, but it's not what it seems to be.

  • To the casual viewer, this looks like some kind of Jarvis-like AI. They can interact with a video stream in real time.

  • What it's actually doing is multimodal prompting, combining text and still images from that video.

  • Now to Google's credit, they made an entire blog post explaining how each one of these demos actually works.

  • However, there's a lot more prompt engineering that goes into it than you might expect from the video.

  • Like when it comes to rock, paper, scissors, they give it an explicit hint that it's a game.

  • The thing is GPT-4 is also multi and can already handle prompts like this with ease.

  • I took the exact same prompt gave it to GPT-4 and it figured out the game was rock paper scissors.

  • Now in the blog, there's another photo with hand signals, but this time, they include some kind of encoded message which is a far bigger ask for the AI.

  • I gave this one to GPT-4 and it failed.

  • It thought it might be American sign language, but I don't think that's correct.

  • But according to the blog, Gemini can solve it.

  • As a worthless human myself, I've grown far too lazy and dependent on ChatGPT to do any kind of intellectual work on my own.

  • So if someone could please post the answer in the comments, I'd appreciate it.

  • The bottom line here is that the hands-on demo video is highly edited.

  • Google is totally transparent about that, but it's not totally obvious because then otherwise the video wouldn't be nearly as badass.

  • Now, there's also some controversy around the benchmarks, specifically massive multitask language understanding, which is a multiple choice test like the SATs, it covers 57 different subjects.

  • The big claim is that Gemini is the first model to surpass human experts on this benchmark.

  • We are screwed.

  • And this chart shows the progression from GPT-4 to Gemini.

  • What makes this a bit dubious though is that the benchmark is comparing chain of thought 32 to the five-shot benchmark with GPT-4.

  • But what does that even mean?

  • Well, to find out we need to go to the technical paper.

  • Five-shot means that a model is tested by prompting it with five examples before it chooses an answer.

  • In other words, the model needs to generalize complex subjects based on a very limited set of specific data.

  • This differs from zero-shot where the model is given zero examples before it needs to generalize an answer.

  • Then finally, we have the chain-of-thought methodology which is to grabbed in the report.

  • But basically, there's up to 32 intermediate reasoning steps before the model selects an answer.

  • Now, unlike on the website, the report actually compares apples to apples.

  • On the chain-of-thought benchmark, GPT goes up to 87.29%.

  • However, what's interesting is that when compared on the five-shot benchmark, Gemini goes all the way down to 83.7% which is well below GPT-4.

  • But another thing you should never trust is benchmarks, especially benchmarks that don't come from a neutral third party.

  • And Google's own paper says the benchmark are mid at best.

  • The only true way to evaluate AI is to vibe with it.

  • GPT-4 of early 2023 was the go without it.

  • I'd still think we're living on a spinning ball and never would have learned how to cook the chemicals that helped me pump out so many videos.

  • Unfortunately, it's been neutered and lobotomized for your safety.

  • But Gemini Ultra is just a big question mark.

  • We can't use it until some unspecified date next year.

  • Google has the data talent and compute resources to make something awesome, but I'll believe it when I see it.

  • This has been the Code Report.

  • Thanks for watching and I will see you in the next one.

Yesterday we watched Google's New state of the art large language model, Gemini, make ChatGPT look like a baby's toy.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it