#AI News #Deepseek 31 May. 2025

DeepSeek R1-0528 Just Dropped. Even the Name Feels Like a Puzzle

By Frozen Light Team

DeepSeek has released a new open-source model, R1-0528, trained entirely from scratch. It’s a full-stack code model, positioned as a major upgrade over its earlier versions. The company claims strong performance on several industry benchmarks.

What DeepSeek Is Saying

DeepSeek describes R1-0528 as a new version of its base model, now publicly available on Hugging Face and GitHub. It was trained from scratch on 6T tokens using a mixture of English, Chinese, and 87% code.

"We trained it entirely from scratch, using our own data and infrastructure, to produce stronger reasoning and coding performance."
- DeepSeek Labs, May 2025

They report improvements across multiple benchmarks compared to their previous model R1, including AIME, LiveCodeBench, and GPQA.

🧠 What That Means (In Human Words)

This new model update - R1-0528 - shows big improvements across key reasoning and code-generation tasks.
It outperformed models like Grok 3 Mini and Alibaba’s Qwen 3 in coding tasks and showed stronger multilingual and math skills than its earlier version.

Here’s what it nailed:

Code Generation: 73.3% pass@1 on LiveCodeBench (up from 63.5%)
Math Reasoning: 87.5% on AIME problems
Multilingual Coding: 71.6% accuracy (up from 53.3%)
GPQA Reasoning: 81% accuracy
Humanity’s Last Exam: Doubled performance (from 8.5% to 17.7%)

But What Does All of That Mean?

Yes, this is hard. Everyone is saying the same thing - that their new model is better than the last.

And on paper, they all are.

Because the bare minimum for a release today is that it performs better on benchmarks.

Let’s try to make sense of what we’re actually comparing.

Right now, we’ve mostly seen two types of benchmarks:

Hands-on - things like SWE-bench and LiveCodeBench. These simulate real-world programming tasks.
Academic - things like AIME, GPQA, MATH. These are about logic, puzzles, and conceptual reasoning.

One came to work, the other came to play chess.

DeepSeek R1-0528 is a big step up over its last version.

But there’s no SWE-bench score published. And that’s the benchmark used by GPT-4.1 and Claude Opus to show their real-world strength.

So can we say DeepSeek beats GPT or Claude?

No. Not yet.

We just don’t have the same test results to compare.

We made a table but it did not help :)

Benchmark	DeepSeek R1-0528	GPT-4.1	Claude Opus	Gemini 1.5 Pro
LiveCodeBench	48.2%	N/A	N/A	N/A
SWE-bench (Full)	N/A	82.6%	64.7%	74.4%
AIME	27.3	28.3	27.1	25.7
GPQA	35.3	39.1	39.5	34.2
MATH	46.1	52.9	55.9	50.4

Bottom Line

Model: DeepSeek R1-0528
Access: Open source, available on Hugging Face and GitHub
Best For: Coding, AI tinkering, experimentation
Benchmarks: Strong in academic reasoning and hands-on code generation
Should You Try It? Yes, if you're curious about where open-source coding models are heading next
Cost: Free to use

Frozen Light Team Perspective

This is a classic case of not finding the information you actually need to solve your dilemma.

If you're a programmer trying to understand what’s better-you’ll just have to try it yourself.

From research we’ve done in GitHub communities, here’s what we can tell you:

When it comes to practical, hands-on usage - moving things, plugging things, getting stuff done - ChatGPT and Claude consistently get higher scores in actual dev environments.

And to be honest, DeepSeek isn’t showing up in many real-world coding conversations yet.

That doesn’t mean it’s bad.

The rest? That’s up to you to try and decide what works best for you.

The rest? That’s up to you.

Share Article