Placeholder Image

Subtitles section Play video

  • This new large-language model has taken the tech world by absolute storm and represents a big breakthrough in the AI research community.

    這種新的大型語言模型在科技界掀起了一場絕對的風暴,是人工智能研究界的一大突破。

  • Last Sunday, while TechTalk was banned for 12 hours, an AI research team from China released a new large-language model called DeepSeek R1.

    上週日,在 TechTalk 被禁 12 小時的同時,來自中國的一個人工智能研究團隊發佈了一個名為 DeepSeek R1 的新的大語言模型。

  • As you can see on the screen, DeepSeek R1's benchmark shows that it performs at a similar level to OpenAI's O1 model on reasoning problems like math, coding, and scientific reasoning.

    從螢幕上可以看到,DeepSeek R1 的基準測試表明,它在數學、編碼和科學推理等推理問題上的表現與 OpenAI 的 O1 模型水準相當。

  • And in this video, I'll talk about the three main takeaways from their paper, including how they use Chain of Thought in order to have the model self-evaluate its performance, how it uses pure reinforcement learning to have the model guide itself, and how they use model distillation to make DeepSeek and other LLMs more accessible to everyone.

    在這段視頻中,我將談談他們論文中的三個主要啟示,包括他們如何使用 "思維鏈"(Chain of Thought)讓模型自我評估性能,如何使用純強化學習讓模型自我指導,以及如何使用模型提煉讓 DeepSeek 和其他 LLM 更容易為每個人所使用。

  • Chain of Thought is a very simple but effective prompt engineering technique where we pretty much ask the model to think out loud.

    思維鏈是一種非常簡單但有效的提示工程技術,我們幾乎可以要求模型大聲思考。

  • Where we add to our prompts that we want the model to explain its reasoning step-by-step.

    在提示中,我們希望模型逐步解釋其推理。

  • That way, if the model makes any mistakes, we can easily pinpoint where in its reasoning it was off so that we can re-prompt the model to not make the mistake again.

    這樣,如果模型犯了任何錯誤,我們就可以很容易地找出它在推理過程中的失誤之處,從而重新提示模型不要再犯同樣的錯誤。

  • Here is an example from the paper, where if you give the model a question like this math problem, you can see that in its response, it actually reasons through it and gives you the steps to how it got to the solution.

    下面是論文中的一個例子,如果你給模型出一道像這樣的數學題,你可以看到,在它的回答中,它實際上是在推理,並告訴你它是如何得出解題步驟的。

  • It showed its work.

    它展示了自己的工作。

  • You can see in red, it says, wait, wait, there's an aha moment, as well as, let's reevaluate this step-by-step.

    你可以看到紅色字體寫著:"等等,等等,有一個'啊哈'時刻",以及 "讓我們一步步重新評估"。

  • In doing so, the model is going to have a more accurate response than if you were to just give the response by itself without Chain of Thought reasoning.

    這樣,模型就會比你不經過思維鏈推理而直接給出答案更準確。

  • The way DeepSeek uses reinforcement learning is a little different how most AI models are We don't give it the question and answer, we kind of let it learn on its own.

    DeepSeek 使用強化學習的方式與大多數人工智能模型有些不同。 我們不給它問題和答案,而是讓它自己學習。

  • This is exactly the same way in how a baby learns how to walk for the first time.

    這與嬰兒第一次學習走路的方式如出一轍。

  • If you notice, if you've ever seen a baby, it's actually pretty funny.

    如果你注意到,如果你見過嬰兒,這其實很有趣。

  • They stumble around the environment, and they maybe hold on to things as they try to decide how to walk.

    他們在環境中跌跌撞撞,在試圖決定如何行走時可能會抓著東西不放。

  • In doing so, they're learning how to move and position their joints so that they don't fall.

    在此過程中,他們學會了如何移動和定位自己的關節,以免摔倒。

  • In the same way, reinforcement learning allows us to train a model by optimizing its policy, aka how the model behaves, and it does so to maximize the reward.

    同樣,強化學習可以讓我們通過優化模型的策略(也就是模型的行為方式)來訓練模型,並使獎勵最大化。

  • As it explores its environment over time, it learns which policies maximize the reward.

    隨著時間的推移,它在探索環境的過程中,學會了哪些政策能使回報最大化。

  • Then it just probably picks the policy over here, or the policy over here.

    然後,它可能就會選擇這邊的政策,或者這邊的政策。

  • For example, if you're solving an equation like this, there's two or three different ways to solve it, but one of them is much shorter than the other way to solve it, and thus has a much higher reward than the other.

    例如,如果你在解這樣一個方程,有兩三種不同的解法,但其中一種解法比另一種解法短得多,是以獎勵也比另一種高得多。

  • Reinforcement learning is exactly how most robots learn how to walk, and how Tesla's self-driving car learns how to drive through a city.

    強化學習正是大多數機器人學習如何行走的方法,也是特斯拉自動駕駛汽車學習如何在城市中行駛的方法。

  • If we go to the paper and look at this graph, we can see how DeepSeek R1 improves how accurately it can answer questions if we train it over time.

    如果我們翻閱論文並查看這張圖,就能看到 DeepSeek R1 是如何通過長期訓練提高答題準確率的。

  • Using reinforcement learning, instead of telling the model what a correct answer is to a question, since that kind of data is pretty expensive to obtain, we instead let it figure out on its own while measuring how accurate the model is.

    使用強化學習時,我們不會告訴模型問題的正確答案是什麼,因為這種數據的獲取成本很高,而是讓它自己找出答案,同時測量模型的準確度。

  • You can see while OpenAI's O1 model is static, DeepSeek R1 eventually outperforms OpenAI's O1 model, and if we let it train for even longer, it looks like it's going to perform even more and get closer to 90 or even 100% accuracy if we kept training it.

    你可以看到,OpenAI 的 O1 模型是靜態的,而 DeepSeek R1 最終超越了 OpenAI 的 O1 模型,如果我們讓它訓練更長時間,它的表現會更出色,如果我們繼續訓練它,準確率會接近 90%,甚至 100%。

  • You can see how the model uses chain-of-thought reasoning in order to improve its responses over time and self-reflect.

    您可以看到該模型是如何利用思維鏈推理來改進其反應並進行自我反思的。

  • In reinforcement learning, we can't exactly tell the model how to change its policy, so that's why we use chain-of-thought reasoning to force the model to self-reflect and evaluate to change its behavior to get closer to a maximum reward.

    在強化學習中,我們無法準確地告訴模型如何改變其策略,是以我們使用思維鏈推理來迫使模型進行自我反思和評估,以改變其行為,從而更接近最大獎勵。

  • That way, we can give the model the right incentives using prompts, and the model can re-evaluate how it answers questions, and it can do so with an increasing accuracy.

    這樣,我們就可以通過提示給模型以正確的激勵,而模型也可以重新評估它是如何回答問題的,而且準確率會越來越高。

  • This equation is the key behind how DeepSeek uses reinforcement learning in order to optimize its policy.

    這個等式是 DeepSeek 利用強化學習優化策略的關鍵所在。

  • It uses group-relative policy optimization in order to essentially use this equation to score how well it answered a question without having the correct answer.

    它使用組相關策略優化,以便在沒有正確答案的情況下,利用這個等式對問題的回答程度進行評分。

  • This looks very, very complicated, and I'll just briefly explain the most important parts of it.

    這看起來非常非常複雜,我只簡單解釋其中最重要的部分。

  • What we do is we take pretty much the expectation of the old answers from the old policy the model has.

    我們所做的是,從模型的舊政策中提取幾乎與舊答案相同的期望值。

  • Remember, the policy pi, this is the key thing that we're trying to optimize with DeepSeek, where we want to change the policy so that DeepSeek can then output better and more correct answers.

    請記住,策略 pi 是我們試圖優化 DeepSeek 的關鍵所在,我們希望通過改變策略,讓 DeepSeek 輸出更好、更正確的答案。

  • So what we do is we take a weighted average of how the model responded with its old policy and how it used its old policy to answer questions versus how the model's new policy answers questions as well.

    是以,我們要做的是,根據模型對其舊政策的迴應情況、使用舊政策回答問題的情況,以及模型的新政策回答問題的情況,進行加權平均。

  • And we also multiply it by some standardization value, ai.

    我們還要乘以某個標準化值 ai。

  • Ai is basically saying, compared to the average reward, how well does this new policy increase the reward?

    Ai 基本上是說,與平均獎勵相比,這項新政策能在多大程度上提高獎勵?

  • And what we also want to do is we don't want to have the model's policy change too much because that can cause a lot of instability with model training.

    我們還想做的是,我們不想讓模型的政策變化太大,因為這會給模型訓練帶來很多不穩定因素。

  • If you look at most reinforcement learning charts and graphs, or even the example of a baby, the baby's going to fall down unpredictably so many times.

    如果你看一下大多數強化學習的圖表,甚至以嬰兒為例,嬰兒會不可預知地摔倒很多次。

  • And what we want to do is we want to make sure our model is as stable as possible and we avoid a roller coaster of policy changes.

    我們要做的是,確保我們的模式儘可能穩定,避免政策變化像雲霄飛車一樣。

  • That's where this clipping comes in.

    這就是這份剪報的由來。

  • Clipping essentially restricts how much our policy can change by 1 minus epsilon and 1 plus epsilon.

    剪切基本上限制了我們的政策在 1 減ε和 1 加ε之間的變化幅度。

  • And we also standardize that.

    我們還將其標準化。

  • So the weighted average is taking basically how small of a change can we change our policy in order to maximize the reward.

    是以,加權平均值基本上是指,為了使回報最大化,我們可以對政策進行多小的改動。

  • We also subtract it from this regularization term called KL divergence.

    我們還將其從名為 KL 發散的正則化項中減去。

  • This pretty much also is another way for us to stabilize our model training by making sure it doesn't change too much.

    這幾乎也是我們穩定模型訓練的另一種方法,確保它不會發生太大變化。

  • And in short, all this is trying to say is that we don't want our policy for our model to change too much, but we want to do so in a way that we can compare our old answers with the new answers.

    簡而言之,這只是想說,我們不希望我們的模型政策發生太大變化,但我們希望這樣做時,我們可以將舊答案與新答案進行比較。

  • And then we change our policy so that we can maximize, ultimately, the policy changes.

    然後,我們改變政策,最終實現政策變化的最大化。

  • We can maximize the reward from the policy changes that are minimized.

    我們可以從最小化的政策變化中獲得最大的回報。

  • It's like a min-max kind of situation here.

    這裡就像一個 "最小極限"。

  • And that's what it's doing here with the weighted average.

    這就是加權平均數的作用。

  • And so the third important technique that the DeepSeq researchers use with their R1 model is model distillation.

    是以,DeepSeq 研究人員使用 R1 模型的第三項重要技術就是模型提煉。

  • And the idea here is that the actual DeepSeq model is 671 billion parameters.

    這裡的意思是,實際的 DeepSeq 模型有 6710 億個參數。

  • And to run this, you pretty much need a couple thousand dollar GPU at least, as well as a pretty expensive computer to actually run the full model.

    要運行這個模型,你至少需要幾千美元的 GPU,以及一臺相當昂貴的電腦來實際運行整個模型。

  • So to make it more accessible, what they do is they take the larger LLM and then they use it to teach a smaller LLM how it reasons and how it answers questions so that the smaller LLM can actually perform on the same level as the bigger LLM, but at a magnitude of a smaller parameter size, like 7 billion parameters.

    是以,為了讓更多人瞭解它,他們所做的就是使用較大的 LLM,然後用它來教較小的 LLM 如何推理,如何回答問題,這樣較小的 LLM 實際上可以與較大的 LLM 達到相同的水準,但參數量級較小,比如 70 億個參數。

  • And in the paper, the DeepSeq researchers distilled from their DeepSeq model into LLAMA3 as well as QWEN.

    而在論文中,DeepSeq 研究人員將其 DeepSeq 模型提煉成了 LLAMA3 以及 QWEN。

  • And the idea here is that the teacher uses, again, chain of thought reasoning in order to generate examples or generate a lot of examples of it answering questions.

    這裡的想法是,教師再次使用思維鏈推理,以生成回答問題的實例或生成大量實例。

  • And then those examples, it just gives directly to the student as part of the prompt.

    然後,這些例子就會作為提示的一部分直接提供給學生。

  • And the student is supposed to answer the questions in a similar accuracy as the larger model.

    而學生回答問題的準確度應該與大模型相似。

  • And this makes the whole LLM ecosystem much more accessible to people who don't have as much resources.

    這使得整個法律碩士生態系統更容易為那些沒有那麼多資源的人所利用。

  • And the key insight is that in this paper, they found that the student model during reinforcement learning training actually outperforms the teacher model just by a little bit.

    關鍵的一點是,在這篇論文中,他們發現在強化學習訓練過程中,學生模型實際上只比教師模型強一點點。

  • But it's doing so, again, at a small fraction of the memory and storage required to use it.

    但是,使用它所需的內存和存儲空間同樣很小。

  • And in the experiments from the paper, the researchers actually found that these smaller distilled models from DeepSeq, as I said, outperform larger models like GPT-4.0 and CLOB 3.5 SONNET in these math, coding, and scientific reasoning tasks, as you can see in the table below right here.

    在論文的實驗中,研究人員實際上發現,正如我所說的那樣,在數學、編碼和科學推理任務中,來自 DeepSeq 的這些較小的提煉模型優於 GPT-4.0 和 CLOB 3.5 SONNET 等較大的模型。

  • And from those three things, those are kind of the key concepts behind how DeepSeq works.

    這三點就是 DeepSeq 工作原理背後的關鍵概念。

  • And hopefully you enjoyed this video.

    希望你們喜歡這段視頻。

  • And if you want to, you can go read the paper in the description below, as well as play around with DeepSeq on OLALMA yourself.

    如果你願意,可以閱讀下面描述中的論文,也可以自己在 OLALMA 上玩玩 DeepSeq。

This new large-language model has taken the tech world by absolute storm and represents a big breakthrough in the AI research community.

這種新的大型語言模型在科技界掀起了一場絕對的風暴,是人工智能研究界的一大突破。

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it