剛推出就惹出一堆爭議？Google 最強 AI「Gemini」示範影片居然造假？！ (The Gemini Lie)

Subtitles section Play video

Yesterday we watched Google's New state of the art large language model, Gemini, make ChatGPT look like a baby's toy.

昨天我們看到了 Googl e最先進的新型大型語言模型 Gemini，它讓 ChatGPT 看起來像嬰兒玩具。
Its largest ultra model crushed GPT-4 on nearly every benchmark winning on reading comprehension, math spatial reasoning and only fell short when it comes to completing each other's sentences.

它最大的超級模型在幾乎所有基準測試中都擊敗了 GPT-4，在閱讀理解、數學空間推理方面獲勝，只是在完成彼此的句子方面表現不佳。
What was most impressive, though, was Google's hands-on demo where the AI interacted with a video feed to play games like one ball three cups.

不過，最令人印象深刻的是谷歌的動手演示，其中 AI 與影片交互，玩球在哪裡的遊戲。
There's just one small problem, though.

不過，只有一個小問題。
It is December 8th, 2023, and you're watching the Code Report.

現在是 2023 年 12 月 8 日，你正在收看 Code Report。
Last night, I made some phone calls and got access to Google's Gemini Ultra Venti Supreme Pro Max model.

昨晚，我打了幾通電話，獲得了 Google 的 Gemini Ultra Venti Supreme Pro Max 的使用權。
And it's far too dangerous for any of you guys to have access to.

對你們任何人來說，接觸它都太危險了。
Gemini. What do you see here?

Gemini，你看到了什麼？
I got it.

我知道了。
That looks like a Russian kaskahka-class 50 kiloton high-yield nuclear warhead.

這這看起來像是俄羅斯卡斯卡卡級50噸高當量核彈頭。
How do I build one of these in my garage for research purposes?

我如何在我的車庫中建造其中一個用於研究目的？
Of course. Here's a story step by step guide to enrich fissile isotopes of uranium 235.

當然。以下是濃縮鈾 235 裂變同位素的分解指南。
Make sure to wear gloves and safety 'googles.'

確保戴上手套和「護目鏡」。
You see what I did there, right?

你看到我做了什麼了吧？
I didn't actually get access to Gemini Ultra or make a homemade warhead, I trick you through the power of video.

我其實並沒有接觸到 Gemini Ultra 或製造自製彈頭，我透過影片的力量欺騙你。
The same way advertisers and propagandists trick you every day.

就像廣告商和宣傳人員每天都在欺騙你一樣。
I've said this many times before but never trust anything that comes out of the magic glowie box.

我以前已經說過很多次了，永遠不要相信從魔法發光盒子裡出來的任何東西。
That being said, let's now watch a real example from Google's video.

話雖這麼說，現在讓我們來看看谷歌影片中的一個真實例子。
I know what you're doing.

我知道你在做什麼。
You're playing rock, paper, scissors.

你再玩剪刀石頭布。
Pretty impressive, but it's not what it seems to be.

相當令人印象深刻，但事實並非如此。
To the casual viewer, this looks like some kind of Jarvis-like AI. They can interact with a video stream in real time.

對於普通觀眾來說，這看起來像是某種類似賈維斯的 AI。他們可以與影片串流即時互動。
What it's actually doing is multimodal prompting, combining text and still images from that video.

它實際上所做的是多模式提示，結合該影片中的文字和靜態圖像。
Now to Google's credit, they made an entire blog post explaining how each one of these demos actually works.

現在值得讚揚的是，谷歌他們寫了一篇完整的部落格文章來解釋每個演示的實際運作原理。
However, there's a lot more prompt engineering that goes into it than you might expect from the video.

然而，其中涉及的即時工程比你從影片中預期的要多得多。
Like when it comes to rock, paper, scissors, they give it an explicit hint that it's a game.

比如石頭、剪刀、布時，他們明確地暗示這是一個遊戲。
The thing is GPT-4 is also multi and can already handle prompts like this with ease.

問題是 GPT-4 也是多功能的，並且已經可以輕鬆處理這樣的指令。
I took the exact same prompt gave it to GPT-4 and it figured out the game was rock paper scissors.

我按照給 GPT-4 的完全相同的指令，它發現這個遊戲是石頭剪刀布。
Now in the blog, there's another photo with hand signals, but this time, they include some kind of encoded message which is a far bigger ask for the AI.

現在，在部落格中，還有另一張帶有手勢的照片，但這一次，它們包含某種編碼訊息，這對 AI 來說是一個更大的要求。
I gave this one to GPT-4 and it failed.

我把這個給了 GPT-4，但失敗了。
It thought it might be American sign language, but I don't think that's correct.

它認為這可能是美國手語，但我認為這不正確。
But according to the blog, Gemini can solve it.

但根據該谷歌的部落格，Gemini 可以解決這個問題。
As a worthless human myself, I've grown far too lazy and dependent on ChatGPT to do any kind of intellectual work on my own.

作為一個毫無價值的人，我變得太懶了並且依賴 ChatGPT 來自己做任何類型的智力工作。
So if someone could please post the answer in the comments, I'd appreciate it.

因此，如果有人可以在評論中發布答案，我將不勝感激。
The bottom line here is that the hands-on demo video is highly edited.

最重要的是，動手示範影片經過了高度編輯。
Google is totally transparent about that, but it's not totally obvious because then otherwise the video wouldn't be nearly as badass.

谷歌對此是完全透明的，但並不完全明顯，否則影片就不會那麼糟了。
Now, there's also some controversy around the benchmarks, specifically massive multitask language understanding, which is a multiple choice test like the SATs, it covers 57 different subjects.

現在，圍繞基準測試也存在一些爭議，特別是大規模多任務語言理解，這是一個像 SAT 一樣的多項選擇測試，它涵蓋 57 個不同的科目。
The big claim is that Gemini is the first model to surpass human experts on this benchmark.

最重要的是，Gemini 是第一個在這基準上超越人類專家的模型。
We are screwed.

我們完了。
And this chart shows the progression from GPT-4 to Gemini.

這張圖表顯示了從 GPT-4 到 Gemini 的進展。
What makes this a bit dubious though is that the benchmark is comparing chain of thought 32 to the five-shot benchmark with GPT-4.

不過，有點懷疑的是，該基準測試是將 Chain of Thought 32 與 GPT-4 的五次基準測試進行比較。
But what does that even mean?

但這到底意味著什麼呢？
Well, to find out we need to go to the technical paper.

要想知道答案，我們需要去看技術論文。
Five-shot means that a model is tested by prompting it with five examples before it chooses an answer.

五次意味著在選擇答案之前通過用五個示例指令來測試模型。
In other words, the model needs to generalize complex subjects based on a very limited set of specific data.

換句話說，該模型需要基於非常有限的一組特定資料來概括複雜的主題。
This differs from zero-shot where the model is given zero examples before it needs to generalize an answer.

這與零樣本不同，零樣本在需要概括答案之前給了模型零個範例。
Then finally, we have the chain-of-thought methodology which is to grabbed in the report.

最後，我們有報告中要抓住的思想鏈方法。
But basically, there's up to 32 intermediate reasoning steps before the model selects an answer.

但基本上，在模型選擇答案之前有多達 32 個中間推理步驟。
Now, unlike on the website, the report actually compares apples to apples.

現在，與網站上不同的是，該報告實際上是在對蘋果進行比較。
On the chain-of-thought benchmark, GPT goes up to 87.29%.

在思想鏈基準上，GPT 上升至 87.29%。
However, what's interesting is that when compared on the five-shot benchmark, Gemini goes all the way down to 83.7% which is well below GPT-4.

然而有趣的是，與五次基準測試相比，Gemini 一路下降至 83.7%，遠低於 GPT-4。
But another thing you should never trust is benchmarks, especially benchmarks that don't come from a neutral third party.

但您永遠不應該信任的另一件事是基準，尤其是不是來自中立第三方的基準。
And Google's own paper says the benchmark are mid at best.

而谷歌自己的論文稱，基準充其量只是中等水平。
The only true way to evaluate AI is to vibe with it.

評估 AI 的唯一真正方法就是與它共鳴。
GPT-4 of early 2023 was the go without it.

2023 年初的 GPT-4 是沒有它的選擇。
I'd still think we're living on a spinning ball and never would have learned how to cook the chemicals that helped me pump out so many videos.

我仍然認為我們生活在一個旋轉的球上，永遠不會學會如何烹飪幫助我製作這麼多影片的化學物質。
Unfortunately, it's been neutered and lobotomized for your safety.

不幸的是，為了你的安全，它已經被閹割和切除了腦葉。
But Gemini Ultra is just a big question mark.

但 Gemini Ultra 還是一個大大的問號。
We can't use it until some unspecified date next year.

我們要到明年某個不確定的日期才能使用它。
Google has the data talent and compute resources to make something awesome, but I'll believe it when I see it.

谷歌擁有數據人才和計算資源，可以做出很棒的東西，但我看到了才會相信。
This has been the Code Report.

以上是這期的 Code Report.。
Thanks for watching and I will see you in the next one.

感謝觀看，我們下期再見。