Placeholder Image

Subtitles section Play video

  • Chances are you've heard about the newest entrant to the very crowded and very competitive realm of AI models, DeepSeek.

    您可能聽說過 DeepSeek,它是人工智能模型領域中最擁擠、競爭最激烈的新成員。

  • It's a startup based in China and it caught everyone's attention by taking over OpenAI's coveted spot for most downloaded free app in the US on Apple's App Store.

    這是一家總部位於中國的初創公司,它在蘋果公司的 App Store 上取代 OpenAI 成為美國下載量最高的免費應用程序,引起了所有人的關注。

  • So how?

    怎麼做?

  • Well by releasing an open source model that it claims can match or surpass the performance of other industry leading models and at a fraction of the cost.

    它發佈了一款開源機型,聲稱其性能可以媲美或超越其他行業領先機型,而成本僅為它們的一小部分。

  • Now the specific model that's really making a splash from DeepSeek is called DeepSeek R1.

    現在,DeepSeek 真正引起轟動的具體型號是 DeepSeek R1。

  • And the R here, that implies reasoning because this is a reasoning model.

    這裡的 "R "意味著推理,因為這是一個推理模型。

  • DeepSeek R1 is their reasoning model.

    DeepSeek R1 是他們的推理模型。

  • Now DeepSeek R1 performs as well as some of the other models including OpenAI's own reasoning model.

    現在,DeepSeek R1 的性能不亞於其他一些模型,包括 OpenAI 自己的推理模型。

  • That's called O1 and it can match or even outperform it across a number of AI benchmarks for math and coding tasks.

    它被稱為 O1,在數學和編碼任務的多項人工智能基準測試中,它的表現可以媲美甚至超過 O1。

  • Which is all the more remarkable because according to DeepSeek, DeepSeek R1 is trained with far fewer chips and is approximately 96% cheaper to run than O1.

    DeepSeek 表示,DeepSeek R1 使用的芯片數量要少得多,運行成本比 O1 低約 96%,是以更加引人注目。

  • Now unlike previous AI models which produced an answer without explaining the why, a reasoning model solves complex problems by breaking them down into steps.

    與以往只給出答案而不解釋原因的人工智能模型不同,推理模型通過將複雜問題分解成多個步驟來解決。

  • So before answering a user query, the model spends time thinking.

    是以,在回答用戶查詢之前,模型會花時間思考。

  • Thinking in air quotes here.

    用引號思考。

  • And that thinking time could be a few seconds or even minutes.

    而思考的時間可能是幾秒甚至幾分鐘。

  • Now during this time, the model is performing step-by-step analysis through a process that is known as chain of thought.

    現在,在這段時間裡,模型正在通過一個被稱為思維鏈的過程進行逐步分析。

  • And unlike other reasoning models, R1 shows the user that chain of thought process as it breaks the problem down, as it generates insights, as it backtracks as it needs to, and as it ultimately arrives at an answer.

    與其他推理模型不同的是,R1 向用戶展示了一連串的思維過程,包括分解問題、產生見解、根據需要進行回溯以及最終得出答案。

  • Now I'm going to get into how this model works, but before that, let's talk about how it came to be.

    現在,我將介紹這種模式的工作原理,但在此之前,我們先來談談它是如何誕生的。

  • DeepSeek R1 seems to have come out of nowhere, but there are in fact many DeepSeek models that brought us to this point.

    DeepSeek R1 似乎是突然出現的,但事實上,有許多 DeepSeek 型號讓我們走到了這一步。

  • A model avalanche, if you like.

    如果你願意,這就是一場雪崩模型。

  • And my colleague Aaron can help dig us out.

    我的同事亞倫可以幫我們挖出來。

  • Thanks Martin, there is certainly a lot to dig out here.

    謝謝馬丁,這裡肯定有很多值得挖掘的東西。

  • There's a lot of these models, but let's start from the very top and beginning of all this.

    這些模型有很多,但讓我們從最頂端和最開始說起。

  • So we begin and we go to, let's say DeepSeek version 1, which is a 67 billion model that was released in January of 2024.

    是以,我們從 DeepSeek 版本 1 開始,這是 2024 年 1 月發佈的 670 億模型。

  • Now this is a traditional transformer with a focus on the feed-forward neural networks.

    這是一個傳統的變壓器,重點是前饋神經網絡。

  • This gets us down into DeepSeek version 2, which really put this on the map.

    這樣,我們就進入了 DeepSeek 的第 2 版,該版本讓 DeepSeek 真正進入了人們的視野。

  • This is a very large 236 billion model that was released not that far away from the original, which is June 2024.

    這是一個規模非常大的 2360 億美元模型,發佈時間距離原定的 2024 年 6 月並不遙遠。

  • But to put this into perspective, there are really two novel aspects around this model.

    不過,從這個角度來看,這種模式確實有兩個新穎之處。

  • The first one was the multi-head latent attention.

    第一種是多頭潛注意。

  • And the second aspect was the DeepSeek mixture of experts.

    第二個方面是 DeepSeek 混合專家。

  • It just made the model really fast and performant.

    這讓模型變得非常快速和高效。

  • And it set us up for success for the DeepSeek version 3, which was released December of 2024.

    這為我們在 2024 年 12 月發佈的 DeepSeek 第三版中取得成功奠定了基礎。

  • Now this one is even bigger.

    現在這個更大了。

  • It's 671 billion parameters.

    這是 6710 億個參數。

  • But this is where we began to see the introduction of using reinforcement learning with that model.

    但也正是從那時起,我們開始在該模型中引入強化學習。

  • And some other contributions that this model had is it was able to balance load across many GPUs because they used a lot of H800s within their infrastructure.

    這個模型的其他一些貢獻是,它能夠在許多 GPU 之間平衡負載,因為他們在基礎設施中使用了大量的 H800。

  • And that was also built around on top of DeepSeek V2.

    這也是在 DeepSeek V2 的基礎上開發的。

  • So all these models accumulate and build on top of each other, which gets us down into DeepSeek R1.0, which was released in January of 2025.

    是以,所有這些模型都是在彼此的基礎上積累和發展起來的,這讓我們進入了 2025 年 1 月發佈的 DeepSeek R1.0。

  • So this is the first of the raising models now, right?

    這就是現在的第一種養育模式,對嗎?

  • It is.

    就是這樣。

  • Yeah.

    是啊

  • And it's really neat how they began to train these types of models.

    他們是如何開始訓練這類模型的,這真的很了不起。

  • So it's a type of fine tuning.

    是以,這是一種微調。

  • But on this one, they exclusively use reinforcement learning, which is a way where you have policies and you want to reward or you want to penalize the model for some action that it has taken or output that it has taken.

    但在這一項目中,他們只使用了強化學習,也就是制定策略,對模型採取的行動或輸出的結果進行獎勵或懲罰。

  • And it self-learns over time.

    隨著時間的推移,它還能自我學習。

  • And it was very performant.

    而且性能非常好。

  • It did well.

    它做得很好。

  • But it got even better with DeepSeek R1, which was, again, built on top of R1.0.

    而 DeepSeek R1 則更加出色,它也是建立在 R1.0 的基礎上。

  • And this one used a combination of reinforcement learning and supervised fine tuning, the best of both worlds, so that it could even be better.

    而這一次,它將強化學習和監督微調結合起來,兩全其美,甚至可以做得更好。

  • And it's very close to performance on many standards and benchmarks as some of these open AI models we have now.

    在許多標準和基準上,它的性能都非常接近我們現在擁有的一些開放式人工智能模型。

  • And this gets us down into now distilled models, which is like a whole other paradigm.

    這讓我們進入了現在的提煉模型,這就像是一個全新的範式。

  • Distilled models.

    蒸餾模型。

  • Okay.

    好的

  • So tell me what that is all about.

    告訴我這是怎麼回事?

  • Yeah, great question and comment.

    是啊,很好的問題和評論。

  • So first of all, a distilled model is where you have a student model, which is a very small model, and you have the teacher model, which is very big.

    是以,首先,提煉模型是指你有一個學生模型,這是一個非常小的模型,而你有一個教師模型,這是一個非常大的模型。

  • And you want to distill or extract knowledge from the teacher model down into the student model.

    你想從教師模式中提煉或提取知識,並將其轉化為學生模式。

  • In some aspects, you could think of it as model compression.

    在某些方面,你可以把它看作是模型壓縮。

  • But one interesting aspect around this is this is not just compression or transferring knowledge, but it's model translation, because we're going from the R1.0, which is one of those mixture of expert models, down into, for example, a LAMA series model, which is not a mixture of experts, but it's a traditional transformer.

    但其中一個有趣的方面是,這不僅僅是壓縮或轉移知識,而是模型轉換,因為我們正在從 R1.0 模型(專家混合模型之一)向下轉換,例如 LAMA 系列模型,它不是專家混合模型,而是傳統的轉換器。

  • So you're going from one architecture type to another, and we do the same with QUINT.

    是以,你需要從一種架構類型轉換到另一種架構類型,我們在 QUINT 上也是如此。

  • Right?

    對不對?

  • So there's different series of models that are the foundation that we then distill into from the R1.0.

    是以,有一系列不同的模型是我們從 R1.0 提煉出來的基礎。

  • Well, thanks.

    謝謝

  • That's really interesting to get the history behind all this.

    瞭解這一切背後的歷史真的很有趣。

  • It didn't come from nowhere.

    它不是憑空出現的。

  • But with all of these distilled models coming, I think you might need your shovel back to dig your way out of those.

    但是,隨著所有這些蒸餾模型的出現,我想你可能需要拿回你的鏟子才能從這些模型中挖出來。

  • Thank you very much.

    非常感謝。

  • There's going to be a lot of distilled models.

    會有很多經過提煉的模型。

  • So you're exactly right.

    所以你說得很對。

  • I think I'm going to go dig.

    我想我要去挖土了。

  • Thanks.

    謝謝。

  • So R1.0 didn't come from nowhere.

    是以,R1.0 並不是憑空出現的。

  • It's an evolution of other models.

    它是其他模式的進化版。

  • But how does DeepSeq operate at such comparatively low cost?

    但是,DeepSeq 是如何以如此低的成本運行的呢?

  • Well, by using a fraction of the highly specialized NVIDIA chips used by their American competitors to train their systems.

    美國競爭對手在訓練系統時使用的英偉達(NVIDIA)高度專業化芯片,使用量僅為他們的一小部分。

  • In fact, I can illustrate this in a graph.

    事實上,我可以用一張圖來說明這一點。

  • So if we consider different types of model and then the number of GPUs that they use.

    是以,如果我們考慮不同類型的模型以及它們使用的 GPU 數量。

  • Well, DeepSeq engineers, for example, they said that they only need 2000 GPUs, that's graphical processing units, to train the DeepSeq V3 model.

    DeepSeq 工程師舉例說,他們只需要 2000 個 GPU(即圖形處理單元)就能訓練 DeepSeq V3 模型。

  • DeepSeq V3.

    DeepSeq V3

  • Now, in isolation, what does that mean?

    孤立地看,這意味著什麼?

  • Is that good?

    這樣好嗎?

  • Is that bad?

    這樣不好嗎?

  • Well, by contrast, Meta said that the company was training their latest open source model.

    相比之下,Meta 表示公司正在培訓最新的開源模型。

  • That's Llama 4.

    那是 Llama 4。

  • And they are using a computer cluster with over 100,000 NVIDIA GPUs.

    他們使用的計算機集群擁有 10 萬多個英偉達™(NVIDIA®)圖形處理器。

  • So that brings up the question of how is it so efficient?

    這就引出了一個問題:它為何如此高效?

  • Well, DeepSeq R1 combines chain of thought reasoning with a process called reinforcement learning.

    DeepSeq R1 將思維鏈推理與強化學習相結合。

  • This is a capability that Aaron mentioned just now, which arrived with the V3 model of DeepSeq.

    這是亞倫剛才提到的一項功能,它隨 DeepSeq V3 型號一起推出。

  • And here, an autonomous agent learns to perform a task through trial and error without any instructions from a human user.

    在這裡,自主代理在沒有人類用戶任何指令的情況下,通過試驗和錯誤學會執行任務。

  • Now, traditionally, models will improve their ability to reason by being trained on labeled examples of correct or incorrect behavior.

    現在,傳統上,模型會通過對正確或錯誤行為的標記示例進行訓練來提高推理能力。

  • That's known as supervised learning or by extracting information from hidden patterns.

    這就是所謂的監督學習或從隱藏模式中提取資訊。

  • That's known as unsupervised learning.

    這就是所謂的無監督學習。

  • But the key hypotheses here with reinforcement learning is to reward the model for correctness.

    但強化學習的關鍵在於獎勵模型的正確性。

  • No matter how it arrived at the right answer and let the model discover the best way to think all on its own.

    不管它是如何得出正確答案的,讓模型自己去發現思考的最佳方式。

  • Now, DeepSeq R1 also uses a mixture of experts architecture or MOE.

    現在,DeepSeq R1 還採用了混合專家架構或 MOE。

  • And a mixture of experts architecture is considerably less resource intensive to train.

    而且,混合專家架構的培訓所需的資源要少得多。

  • Now, the MOE architecture divides an AI model up into separate entities or sub networks, which we can think of as being individual experts.

    現在,MOE 架構將人工智能模型劃分為獨立的實體或子網絡,我們可以將其視為各個專家。

  • So in my little neural network here, I'm going to create three experts.

    是以,在我的小神經網絡中,我要創建三個專家。

  • A real MOE architecture probably have quite a bit more than that.

    真正的 MOE 架構可能遠不止這些。

  • But each one of these is specialized in a subset of the input data.

    但每一種都專門針對輸入數據的一個子集。

  • And the model only activates the specific experts needed for a given task.

    而且,該模型只啟動特定任務所需的特定專家。

  • So a request comes in, we activate the experts that we need, and we only use those rather than activating the entire neural network.

    是以,一旦有請求,我們就會激活我們需要的專家,而且我們只使用這些專家,而不是激活整個神經網絡。

  • So consequently, the MOE architecture reduces computational costs during pre-training and achieves faster performance during inference time.

    是以,MOE 架構降低了預訓練期間的計算成本,並在推理期間實現了更快的性能。

  • And look, MOE, that architecture isn't unique to models from DeepSeq.

    聽著,MOE,這種架構並不是 DeepSeq 模型所獨有的。

  • There are models from the French AI company Mistral that also use this.

    法國人工智能公司 Mistral 的一些機型也採用了這種方法。

  • And in fact, the IBM Granite model that is also built on a mixture of experts architecture.

    而事實上,IBM Granite 模型也是建立在混合專家架構之上的。

  • So it's a commonly used architecture.

    是以,這是一種常用的架構。

  • So that is DeepSeq R1.

    這就是 DeepSeq R1。

  • It's an AI reasoning model that is matching other industry leading models on reasoning benchmarks while being delivered at a fraction of the cost in both training and inference.

    它是一個人工智能推理模型,在推理基準上與其他業界領先的模型不相上下,而在訓練和推理方面的成本卻很低。

  • All of which makes me think that this is an exciting time for AI reasoning models.

    所有這些都讓我認為,這是人工智能推理模型的一個激動人心的時刻。

  • Thank you.

    謝謝。

Chances are you've heard about the newest entrant to the very crowded and very competitive realm of AI models, DeepSeek.

您可能聽說過 DeepSeek,它是人工智能模型領域中最擁擠、競爭最激烈的新成員。

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it