Subtitles section Play video
This new large-language model has taken the tech world by absolute storm and represents a big breakthrough in the AI research community.
Last Sunday, while TechTalk was banned for 12 hours, an AI research team from China released a new large-language model called DeepSeek R1.
As you can see on the screen, DeepSeek R1's benchmark shows that it performs at a similar level to OpenAI's O1 model on reasoning problems like math, coding, and scientific reasoning.
And in this video, I'll talk about the three main takeaways from their paper, including how they use Chain of Thought in order to have the model self-evaluate its performance, how it uses pure reinforcement learning to have the model guide itself, and how they use model distillation to make DeepSeek and other LLMs more accessible to everyone.
Chain of Thought is a very simple but effective prompt engineering technique where we pretty much ask the model to think out loud.
Where we add to our prompts that we want the model to explain its reasoning step-by-step.
That way, if the model makes any mistakes, we can easily pinpoint where in its reasoning it was off so that we can re-prompt the model to not make the mistake again.
Here is an example from the paper, where if you give the model a question like this math problem, you can see that in its response, it actually reasons through it and gives you the steps to how it got to the solution.
It showed its work.
You can see in red, it says, wait, wait, there's an aha moment, as well as, let's reevaluate this step-by-step.
In doing so, the model is going to have a more accurate response than if you were to just give the response by itself without Chain of Thought reasoning.
The way DeepSeek uses reinforcement learning is a little different how most AI models are We don't give it the question and answer, we kind of let it learn on its own.
This is exactly the same way in how a baby learns how to walk for the first time.
If you notice, if you've ever seen a baby, it's actually pretty funny.
They stumble around the environment, and they maybe hold on to things as they try to decide how to walk.
In doing so, they're learning how to move and position their joints so that they don't fall.
In the same way, reinforcement learning allows us to train a model by optimizing its policy, aka how the model behaves, and it does so to maximize the reward.
As it explores its environment over time, it learns which policies maximize the reward.
Then it just probably picks the policy over here, or the policy over here.
For example, if you're solving an equation like this, there's two or three different ways to solve it, but one of them is much shorter than the other way to solve it, and thus has a much higher reward than the other.
Reinforcement learning is exactly how most robots learn how to walk, and how Tesla's self-driving car learns how to drive through a city.
If we go to the paper and look at this graph, we can see how DeepSeek R1 improves how accurately it can answer questions if we train it over time.
Using reinforcement learning, instead of telling the model what a correct answer is to a question, since that kind of data is pretty expensive to obtain, we instead let it figure out on its own while measuring how accurate the model is.
You can see while OpenAI's O1 model is static, DeepSeek R1 eventually outperforms OpenAI's O1 model, and if we let it train for even longer, it looks like it's going to perform even more and get closer to 90 or even 100% accuracy if we kept training it.
You can see how the model uses chain-of-thought reasoning in order to improve its responses over time and self-reflect.
In reinforcement learning, we can't exactly tell the model how to change its policy, so that's why we use chain-of-thought reasoning to force the model to self-reflect and evaluate to change its behavior to get closer to a maximum reward.
That way, we can give the model the right incentives using prompts, and the model can re-evaluate how it answers questions, and it can do so with an increasing accuracy.
This equation is the key behind how DeepSeek uses reinforcement learning in order to optimize its policy.
It uses group-relative policy optimization in order to essentially use this equation to score how well it answered a question without having the correct answer.
This looks very, very complicated, and I'll just briefly explain the most important parts of it.
What we do is we take pretty much the expectation of the old answers from the old policy the model has.
Remember, the policy pi, this is the key thing that we're trying to optimize with DeepSeek, where we want to change the policy so that DeepSeek can then output better and more correct answers.
So what we do is we take a weighted average of how the model responded with its old policy and how it used its old policy to answer questions versus how the model's new policy answers questions as well.
And we also multiply it by some standardization value, ai.
Ai is basically saying, compared to the average reward, how well does this new policy increase the reward?
And what we also want to do is we don't want to have the model's policy change too much because that can cause a lot of instability with model training.
If you look at most reinforcement learning charts and graphs, or even the example of a baby, the baby's going to fall down unpredictably so many times.
And what we want to do is we want to make sure our model is as stable as possible and we avoid a roller coaster of policy changes.
That's where this clipping comes in.
Clipping essentially restricts how much our policy can change by 1 minus epsilon and 1 plus epsilon.
And we also standardize that.
So the weighted average is taking basically how small of a change can we change our policy in order to maximize the reward.
We also subtract it from this regularization term called KL divergence.
This pretty much also is another way for us to stabilize our model training by making sure it doesn't change too much.
And in short, all this is trying to say is that we don't want our policy for our model to change too much, but we want to do so in a way that we can compare our old answers with the new answers.
And then we change our policy so that we can maximize, ultimately, the policy changes.
We can maximize the reward from the policy changes that are minimized.
It's like a min-max kind of situation here.
And that's what it's doing here with the weighted average.
And so the third important technique that the DeepSeq researchers use with their R1 model is model distillation.
And the idea here is that the actual DeepSeq model is 671 billion parameters.
And to run this, you pretty much need a couple thousand dollar GPU at least, as well as a pretty expensive computer to actually run the full model.
So to make it more accessible, what they do is they take the larger LLM and then they use it to teach a smaller LLM how it reasons and how it answers questions so that the smaller LLM can actually perform on the same level as the bigger LLM, but at a magnitude of a smaller parameter size, like 7 billion parameters.
And in the paper, the DeepSeq researchers distilled from their DeepSeq model into LLAMA3 as well as QWEN.
And the idea here is that the teacher uses, again, chain of thought reasoning in order to generate examples or generate a lot of examples of it answering questions.
And then those examples, it just gives directly to the student as part of the prompt.
And the student is supposed to answer the questions in a similar accuracy as the larger model.
And this makes the whole LLM ecosystem much more accessible to people who don't have as much resources.
And the key insight is that in this paper, they found that the student model during reinforcement learning training actually outperforms the teacher model just by a little bit.
But it's doing so, again, at a small fraction of the memory and storage required to use it.
And in the experiments from the paper, the researchers actually found that these smaller distilled models from DeepSeq, as I said, outperform larger models like GPT-4.0 and CLOB 3.5 SONNET in these math, coding, and scientific reasoning tasks, as you can see in the table below right here.
And from those three things, those are kind of the key concepts behind how DeepSeq works.
And hopefully you enjoyed this video.
And if you want to, you can go read the paper in the description below, as well as play around with DeepSeq on OLALMA yourself.