DeepSeek has released a new open-source model, R1-0528, trained entirely from scratch. It’s a full-stack code model, positioned as a major upgrade over its earlier versions. The company claims strong performance on several industry benchmarks.
What DeepSeek Is Saying
DeepSeek describes R1-0528 as a new version of its base model, now publicly available on Hugging Face and GitHub. It was trained from scratch on 6T tokens using a mixture of English, Chinese, and 87% code.
"We trained it entirely from scratch, using our own data and infrastructure, to produce stronger reasoning and coding performance."
- DeepSeek Labs, May 2025
They report improvements across multiple benchmarks compared to their previous model R1, including AIME, LiveCodeBench, and GPQA.
🧠 What That Means (In Human Words)
This new model update - R1-0528 - shows big improvements across key reasoning and code-generation tasks.
It outperformed models like Grok 3 Mini and Alibaba’s Qwen 3 in coding tasks and showed stronger multilingual and math skills than its earlier version.
Here’s what it nailed:
-
Code Generation: 73.3% pass@1 on LiveCodeBench (up from 63.5%)
-
Math Reasoning: 87.5% on AIME problems
-
Multilingual Coding: 71.6% accuracy (up from 53.3%)
-
GPQA Reasoning: 81% accuracy
-
Humanity’s Last Exam: Doubled performance (from 8.5% to 17.7%)
But What Does All of That Mean?
Yes, this is hard. Everyone is saying the same thing - that their new model is better than the last.
And on paper, they all are.
Because the bare minimum for a release today is that it performs better on benchmarks.
Let’s try to make sense of what we’re actually comparing.
Right now, we’ve mostly seen two types of benchmarks:
-
Hands-on - things like SWE-bench and LiveCodeBench. These simulate real-world programming tasks.
-
Academic - things like AIME, GPQA, MATH. These are about logic, puzzles, and conceptual reasoning.
One came to work, the other came to play chess.
DeepSeek R1-0528 is a big step up over its last version.
But there’s no SWE-bench score published. And that’s the benchmark used by GPT-4.1 and Claude Opus to show their real-world strength.
So can we say DeepSeek beats GPT or Claude?
No. Not yet.
We just don’t have the same test results to compare.
We made a table but it did not help :)
Benchmark |
DeepSeek R1-0528 |
GPT-4.1 |
Claude Opus |
Gemini 1.5 Pro |
LiveCodeBench |
48.2% |
N/A |
N/A |
N/A |
SWE-bench (Full) |
N/A |
82.6% |
64.7% |
74.4% |
AIME |
27.3 |
28.3 |
27.1 |
25.7 |
GPQA |
35.3 |
39.1 |
39.5 |
34.2 |
MATH |
46.1 |
52.9 |
55.9 |
50.4 |
Bottom Line
-
Model: DeepSeek R1-0528
-
Access: Open source, available on Hugging Face and GitHub
-
Best For: Coding, AI tinkering, experimentation
-
Benchmarks: Strong in academic reasoning and hands-on code generation
-
Should You Try It? Yes, if you're curious about where open-source coding models are heading next
-
Cost: Free to use
Frozen Light Team Perspective
This is a classic case of not finding the information you actually need to solve your dilemma.
If you're a programmer trying to understand what’s better-you’ll just have to try it yourself.
From research we’ve done in GitHub communities, here’s what we can tell you:
When it comes to practical, hands-on usage - moving things, plugging things, getting stuff done - ChatGPT and Claude consistently get higher scores in actual dev environments.
And to be honest, DeepSeek isn’t showing up in many real-world coding conversations yet.
That doesn’t mean it’s bad.
The rest? That’s up to you to try and decide what works best for you.
The rest? That’s up to you.