DeepSeek V4: When NVIDIA's Nightmares Become Reality (And That Is Not Necessarily a Bad Thing)

I will admit it: this one caught me off guard.
I started using DeepSeek V4 a few days ago, partly out of curiosity and partly because API costs from the usual suspects were starting to look like a line item worth worrying about. After a few days of testing it on real tasks - code analysis, long document summaries, content generation, debugging - I can tell you the model works extremely well. Not “good for a Chinese model” or “good for the price.” Just good.
But the most interesting part is not only the quality. It is what is going on under the hood: an architecture built to do a lot with very little, one that has probably already made someone in Santa Clara, California sweat a little.
We are talking about NVIDIA.
DeepSeek V4: this is not an update, it is a redesign
The first thing to understand about V4 is that this is not a simple case of “we added a few layers and bumped the number.” DeepSeek almost completely redesigned the model architecture. And the result comes in two flavors:
- V4 Pro: the big model. 1.6 trillion parameters in total, but only 49 billion active parameters per token. Yes, you read that correctly: trillions.
- V4 Flash: the leaner sibling. 284 billion total parameters, 13 billion active. Lighter, cheaper, and surprisingly capable.
If you are wondering, “What do total parameters versus active parameters actually mean?”, imagine the model as a huge library. Total parameters are all the books on the shelves - all the knowledge accumulated during training. Active parameters are the ones the librarian actually pulls down to answer your question. Opening every book every single time would be painfully slow and absurdly expensive. DeepSeek, instead, opens only a small and intelligent selection, thanks to an architecture called Mixture-of-Experts (MoE).
The result: it uses 27% of the compute power of V3, and only 10% of the temporary memory. The kind of numbers that can ruin the afternoon of anyone who already ordered a rack full of H100s.
One million tokens of context: finally the standard
Both V4 versions come with a 1 million token context window. It is not a premium feature, not a paid extra: it is the default.
To give you a concrete sense of scale, one million tokens correspond to roughly 750,000 words. The entire Lord of the Rings trilogy is around 576,000. You can ask it to analyze the whole thing in one shot and still have room left over.
But what really impressed me is not just the size of the window. It is that the quality of the answers holds up even when the context is packed to the brim. With many models, feeding in a massive context is a bit like asking someone to remember an entire movie after three sleepless nights: technically they can do it, but the answers get fuzzy, details slip away, contradictions start creeping in. With V4, precisely because of how CSA and HCA work together (we will get to that in a second), the model keeps a coherent grasp of both the detail and the bigger picture - even when you are analyzing an 80,000-line codebase or a 400-page document. It is not magic. It is well-designed architecture.
For anyone working with large codebases, company documentation, long contracts, or complex reports, this is not just a technical detail. It is a practical turning point.
The magical trio: CSA, HCA, and MHC
This is where things get interesting. DeepSeek V4 introduces three architectural innovations that explain how it handles huge contexts without blowing up computational costs.
CSA - Compressed Sparse Attention
Instead of analyzing every single token every time it needs to answer, the model compresses the less relevant information and focuses only on what really matters. It is as if, rather than rereading all 50 chapters of a book just to find the color of the protagonist’s hat, it first created summaries of the different chapters, figured out which one mattered, and only then went back to the pages it actually needed. That means most of the text does not have to be processed in full detail at every step, which saves a substantial amount of compute.
HCA - Heavily Compressed Attention
While CSA handles the detail, HCA keeps hold of the bigger picture. Imagine a second assistant who compressed the entire Lord of the Rings trilogy into a ten-minute retelling and keeps that summary in mind while searching for the hat detail. That way the answer is accurate in the specifics and coherent with the whole. Global context compression goes up to 128x. (Does this remind you of anything similar?)
MHC - Multi-head Constrained Hyper Connections
This technology comes into play during training. In a very deep model, the signal passed from one layer to the next tends to degrade - like the classic game of telephone where the message gets mangled along the way. MHC sends parallel flows and adds a sort of “controller” at every step to make sure the signal stays clean. Without this technique, training a model with 1.6 trillion parameters would be a disaster.
Taken together, these three techniques let V4 do things competitors can only do with much more expensive hardware.
Costs: this is where the music really changes
All right, let us get to the point. How much does DeepSeek V4 cost compared with everyone else?
Here is a direct API comparison (prices per million tokens):
| Model | Input (no cache) | Output |
|---|---|---|
| DeepSeek V4 Pro | ~15% less than Gemini 3.1 | ~4x less than Gemini 3.1 |
| DeepSeek V4 Pro | ~50% less than ChatGPT | ~10x less than ChatGPT |
| DeepSeek V4 Pro | ~50% less than Claude Opus | ~10x less than Claude Opus |
| DeepSeek V4 Flash | almost negligible | almost negligible |
On the input side, the difference is already meaningful. But on the output side - the expensive part, because generating tokens requires far more compute than processing them - the gap turns brutal. Ten times cheaper than ChatGPT and Claude Opus is the kind of thing that changes the math for any developer or company using APIs heavily.
And then there is caching: when you make multiple requests over the same context, for example different analyses of the same document, the cost drops even further because the model does not have to process everything from scratch every time. Like a librarian who already stuck notes on the relevant chapters: from the second question onward, everything moves faster.
And all of this comes with an MIT open-source license, which means you can download it, modify it, integrate it into commercial products, and do so with very few constraints.
The twist: Huawei Ascend and NVIDIA’s nightmare
And here we get to the part that made headlines across the tech world. Or at least across the corners of the tech world where people still read interesting things.
For the first time in its history, DeepSeek cited Huawei Ascend 950 chips and NVIDIA GPUs in the same list in its official technical report, treating them as equivalent hardware platforms.
V4 was designed from the start to run natively on Ascend 950 chips. Just hours after launch, Huawei confirmed that its supernodes already fully supported the V4 models. More importantly, part of the training for the Flash version was carried out directly on Ascend chips.
That is the part that worries Jensen Huang.
For years, anyone who wanted to do serious AI had to rely on American hardware, because CUDA - NVIDIA’s software ecosystem - was the only mature framework for training at scale. There was no practical alternative, especially for Chinese companies already dealing with U.S. export restrictions.
Now, with V4 and Ascend 950, DeepSeek and Huawei are proving that the alternative exists. And it is already in production.
“But Huawei chips are weaker!” - Sure, but did you actually read the paragraph above?
A single Ascend 950 is less powerful than an NVIDIA B300 GPU. By how much? One B300 in FP4 is worth about 7 to 8 Ascend 950s put together.
So what is the point?
The point is that DeepSeek, thanks to CSA, HCA, and MHC, optimizes every single FLOP from those weaker Ascends so efficiently that it compensates for the hardware disadvantage. The result is that running V4 Flash on Huawei hardware can cost up to 60% less than on an equivalent NVIDIA system.
Sixty percent. Read that again slowly.
For the overwhelming majority of real-world applications - where you do not need the absolute strongest model, just one that is reliable and fast - that equation is already solved.
The virtuous cycle for China, and the vicious one for NVIDIA
After launch, ByteDance, Alibaba, and Tencent rushed to order Huawei Ascend 950 chips in industrial quantities.
The more companies use Ascend, the more the CANN software ecosystem grows, the more developers learn to use it, the more the system improves and attracts new talent. It is a self-reinforcing loop.
For NVIDIA, that potentially means losing a meaningful share of the Chinese market - a market worth billions. Not tomorrow, not overnight. But the direction is clear.
What this means for developers like us
In practical terms, for people like me who work with LLM APIs every day, DeepSeek V4 is already a serious option worth considering:
- Projects with limited budgets: the Flash version is almost free compared with the competition, and it already does excellent work on standard tasks.
- Long-document analysis: the one-million-token context window is not marketing. It really works.
- Self-hosting: the MIT license lets you deploy it on your own servers. Zero dependence on external providers, zero data floating around.
- Commercial integration: you can use it in your own products without royalties and with minimal constraints.
It is not the most powerful model in absolute terms - on ultra-specialized benchmarks, GPT-5.5 and Claude Opus 4.7 are still ahead. But for 90% of real-world use cases, V4 delivers excellent performance at a fraction of the cost.
Conclusion
DeepSeek V4 is not just a good AI model. It is concrete proof that algorithmic efficiency can offset hardware disadvantage, that NVIDIA’s monopoly over the AI ecosystem is not eternal, and that real competition is good for prices.
After a few days of using it, my wallet agrees with me.
Jensen Huang probably agrees a little less.