OpenAI’s new “deep-thinking” o1 model crushes coding benchmarks

Subtitles section Play video

I thought it plateaued.
I thought the bubble was about to burst and the hype train was derailing.
I even thought my software engineering job might be safe from Devon.
But I couldn't have been more wrong.
Yesterday, OpenAI released a new terrifying state-of-the-art model named O1.
And it's not just another basic GPT, it's a new paradigm of deep thinking or reasoning models that obliterate all past benchmarks on math, coding, and PhD level science.
And Sam Altman had a message for all AI haters out there.
Before we get too hopeful that O1 will unburden us from our programming jobs though, there are many reasons to doubt this new model.
It's definitely not ASI, it's not AGI, and not even good enough to be called GPT-5.
Following its mission of openness, OpenAI is keeping all the interesting details closed off, but in today's video, we'll try to figure out O1 actually works and what it means for the future of humanity.
It is Friday the 13th, and you're watching The Code Report.
GPT-5, Orion, Q-Star, Strawberry.
These are all names that leaked out of OpenAI in recent months, but yesterday the world was shocked when they released O1 ahead of schedule.
GPT stands for Generative Pre-trained Transformer, and O stands for Oh S*** We're All Gonna Die.
First, let's admire these dubious benchmarks.
Compared to GPT-4, it achieves massive gains on accuracy, most notably in PhD level physics, and on the massive multitask language understanding benchmarks for math and formal logic.
But the craziest improvements come in its coding ability.
At the International Olympiad in Informatics, it was in the 49th percentile when allowed 50 submissions per problem, but then broke the gold medal submission when it was allowed 10,000 submissions.
And compared to GPT-4, its code forces ELO went from the 11th percentile all the way up to the 93rd percentile.
Impressive, but they've also secretly been working with Cognition Labs, the company that wants to replace programmers with this greasy pirate gigolo named Devin.
When using the GPT-4 brain, it only solved 25% of problems, but with GPT-01, the chart went up to 75%.
That's crazy, and our only hope is that these internal closed source benchmarks from a VC-funded company desperate to raise more money are actually just BS.
Only time will tell, but O1 is no doubt a huge leap forward in the AI race.
And the timing is perfect, because many people have been switching from ChatGPT to Claude, and OpenAI is in talks to raise more money at a $150 billion valuation.
But how does a deep thinking model actually work?
Well technically, they released three new models, O1 Mini, O1 Preview, and O1 Regular.
Us plebs only have access to Mini and Preview, and O1 Regular is still locked in a cage, although they have hinted at a $2,000 premium plus plan to access it.
What makes these models special though is that they rely on reinforcement learning to perform complex reasoning.
That means when presented with a problem, they produce a chain of thought before presenting the answer to the user.
In other words, they think.
Descartes said, I think, therefore I am, but O1 is still not a sentient life form.
Just like a human though, it will go through a series of thoughts before reaching a final conclusion, and in the process produce what are called reasoning tokens.
These are like outputs that help the model refine its steps and backtrack when necessary, which allows it to produce complex solutions with fewer hallucinations.
But the tradeoff is that the response requires more time, computing power, and money.
OpenAI released a bunch of examples, like this guy making a playable snake game in a single shot, or this guy creating a nonogram puzzle.
And the model can even reliably tell you how many R's are in the word strawberry, a question that has baffled LLMs in the past.
Actually, just kidding, it failed that test when I tried to run it myself.
And the actual chain of thought is hidden from the end user, even though you do have to pay for those tokens at a price of $60 per 1 million.
However, they do provide some examples of chain of thought, like in this coding example that transposes a matrix in Bash.
You'll notice that it first looks at the shape of the inputs and outputs, then considers the constraints of the programming language, and goes through a bunch of other steps before regurgitating a response.
But this is actually not a novel concept.
Google has been dominating math and coding competitions with AlphaProof and AlphaCoder for the last few years using reinforcement learning by producing synthetic data.
But this is the first time a model like this has become generally available to the public.
Let's go ahead and find out if it slaps.
I remember years ago when I first learned code, I recreated the classic MS-DOS game Dog Wars, a turn-based strategy game where you play the role of a traveling salesman and have random encounters with Officer Hardass.
As a biological human, it took me like a hundred hours to build.
But let's first see how GPT-4-0 does with it.
When I ask it to build this game in C with a GUI, it produces code that almost works, but I wasn't able to get it to compile, and after a couple of follow-up prompts, I finally got something working, but the game logic was very limited.
Now let's give the new 0-1 that exact same prompt.
What you'll notice is that it goes through the chain of thought, like it's thinking, then assessing compliance, and so on, but what it's actually doing under the hood is creating those reasoning tokens, which should lead to a more comprehensive and accurate result.
In contrast to GPT-4, 0-1 compiled right away, and it followed the game requirements to a T.
At first glance, it actually seemed like a flawless game, but it turns out the app was actually pretty buggy.
I kept getting into this infinite loop with Officer Hardass, and the UI was also terrible.
I tried to fix these issues with additional follow-up prompts, but they actually led to more hallucinations and more bugs, and it's pretty clear that this model isn't truly intelligent.
That being said though, there's a huge amount of potential with this chain of thought approach, and by potential, I mean potential to overstate its capabilities.
In 2019, they were telling us GPT-2 was too dangerous to release.
Now five years later, you've got Sam Altman begging the feds to regulate his strawberry.
It's scary stuff, but until proven otherwise, 0-1 is just another benign AI tool.
It's basically just like GPT-4, with the ability to recursively prompt itself.
It's not fundamentally game-changing, but you really shouldn't listen to me.
I'm just like a horse influencer in 1910 telling horses a car won't take your job, but another horse driving a car will.
This has been The Code Report.
Thanks for watching, and I will see you in the next one.