An essay on AI
AI Chip Performance: Why Blackwell Delivers 50x, Not 2x
AI hardware delivers 50x more tokens per watt while spec sheets show 2x. The 25x gap lives in extreme co-design across seven layers, not in any single chip.
An essay on AI

AI Chip Performance: Why Blackwell Delivers 50x, Not 2x

AI hardware delivers 50x more tokens per watt while spec sheets show 2x. The 25x gap lives in extreme co-design across seven layers, not in any single chip.

Watercolor view of dozens of small forested tropical islands scattered across a clear turquoise sea.

If you have followed Artificial Intelligence (AI) hardware announcements and wondered why the generational gaps look so large, this is the place to start. The new chip costs about twice as much per hour and delivers twice the operations per dollar. On the same workload, it ships fifty times more tokens per watt.

Short answer

Why does Blackwell deliver fifty times the performance when the spec sheet says two?

AI chip performance gains compound through extreme co-design. Blackwell delivers fifty times the throughput of Hopper, not the two times a spec sheet implies. The platform optimizes against the model, the network, the rack, the cooling, and the workload at the same time.

On the spec sheet, 2x. In the real world, 50x#

Pull up the comparison between the current AI accelerator and the one before it. On the spec sheet, the new chip costs about twice as much per hour and delivers twice the floating-point operations per dollar. A two-times improvement. Reasonable.

Now look at the output side. The new chip delivers fifty times more tokens per watt. It produces tokens at thirty-five times lower cost.

Chart contrasting the spec-sheet view of the new AI chip (2x cost per GPU hour, 2x floating-point operations per dollar) with the real-world output view (50x tokens per watt, 35x lower cost per token)
Source: Spec sheet on the left. Output on the right. The 25x gap is the whole story.

The twenty-five times gap between the spec-sheet view and the output view is what this post is about. It is also the reason “compare two chips” is the wrong question for AI hardware in 2026.

The reader is a technical leader, an architect, or a curious practitioner who has read the generational announcements and felt the math did not add up. The math adds up. The math is just being done on the wrong unit.

Comparing two AI chips by their spec sheets is like a buyer choosing a car by the size of the engine alone, ignoring the road and the driver and the route. The engine matters. The engine is not the whole story. The road decides what the engine can deliver.

The spec-sheet number is the input. The tokens-per-watt number is the output. The output is what the customer pays for. The input is what the chip itself was billed at. Both are real. The post is about the gap between them and where that gap comes from.

On the spec sheet, two times. In the real world, fifty times.

Co-design is not integration#

The first move toward explaining the fifty is to nail one piece of vocabulary.

Integration and co-design sound like synonyms. They are not.

Integration combines parts that were already independently designed. The parts may work well together. Each was designed for its own goals first. Co-design starts from the outcome and designs every part backward from there. The parts know about each other from the first sketch.

The difference sounds like semantics. It is not. The difference is the difference between a two-times improvement and a fifty-times improvement.

Side-by-side comparison of integration (parts independently designed then combined) and co-design (parts designed together from the start toward one outcome) across seven layers: compute, memory, storage, networking, runtimes, serving software, and ecosystem
Source: Integration on the left. Co-design on the right. Seven layers designed together is the gap most chip comparisons miss.

In modern AI hardware the outcome is the lowest cost per token. The leading chip vendor calls its version of co-design “extreme” because it spans seven layers. Compute. Memory. Storage. Networking. Software runtimes. Serving software. The broader ecosystem of partners and open-source frameworks.

None of those layers alone delivers the fifty. All of them together do.

The modern AI hardware product is not a chip. It is a platform. The Vera Rubin platform has seven chips inside it. The chip is no longer the unit of design. The system is.

Co-design is like a chef designing the kitchen, the menu, the supplier list, and the dishwasher schedule together for one specific style of cooking. Integration is like a homeowner buying the best stove on the market, the best oven on the market, and the best sink on the market, then arranging them in a room. Both kitchens cook. Only the co-designed one runs at peak under load.

The moat is not in any single layer. Anyone can build one good layer. Building all seven, with each layer designed knowing the others, is the harder problem and the bigger gap.

That is why the comparison “two chips on a spec sheet” is the wrong comparison. The thing being compared is no longer the right unit.

A case study: how mixture-of-experts models get their speed#

Vocabulary by itself is not convincing. The way to see co-design is to look at a specific case where it has been done.

Mixture-of-experts models are the clearest case study.

The model architecture sends different inputs to different specialist sub-networks. Different prompts trigger different experts. The architecture is efficient at inference because only a fraction of the model fires for any one input.

The cost of that efficiency is heavy traffic between graphics processing units. Each token’s input has to travel to the right expert, the expert has to compute, and the result has to travel back.

Without co-design, that traffic becomes the bottleneck. The chip waits on the network like a worker at a desk waiting on a phone call that never comes. The output drops. The customer feels it in the bill. The bill is the only number the chief financial officer reads at the end of the month, and the bill is the number the co-design was tuned to lower.

With co-design, the rack-scale hardware is built around the traffic pattern. The networking cable, the memory layout on each card, the routing software, and the serving stack all know in advance that mixture-of-experts traffic will dominate the workload. Each layer is tuned for it. The room of racks behaves like a single machine instead of a hundred separate ones.

The result is the chip running at output capacity instead of waiting on the data path. The fifty-times number is what happens when the seven layers are tuned together for the workload that pays the bill.

This is the case study because mixture-of-experts is also the architecture most large frontier models use. The workload is not exotic. The workload is the workload.

A reader who has watched a sports team get faster after the coach changed the practice schedule knows the pattern from the inside. The athletes are the same. The practice changed. The team is faster because the practice was tuned for the game it plays.

The chips were faster on paper. The platform is faster in the game.

Each optimization is a drop. All of them are the river#

The fifty does not come from one big trick. It comes from dozens of small ones, each contributing a few percent, all compounding.

A faster matrix multiplication primitive. A smarter memory layout for the activations. A networking protocol that overlaps with computation instead of pausing for it. A serving runtime that batches requests intelligently for the model architecture in use. An open-source kernel from a partner that handles one specific operation a few percent better. A compiler pass that fuses operations the old compiler kept separate.

Each optimization is a drop. Each drop is real. None of the drops alone would justify a fifty-times claim. All of them together do.

A hundred small optimizations are like a hundred small leaks that empty the same bathtub. No one leak is the leak. The bathtub is empty in an hour because every leak is dripping at once.

The optimizations stack because the seven layers were designed knowing about each other. The matrix multiplication primitive knows what memory layout to expect. The memory layout knows what the networking protocol will deliver. The networking protocol knows what the serving runtime expects. Each handoff is clean because both sides were designed together.

In a non-co-designed system, every handoff is friction. The matrix multiplication primitive does its work and hands the result to a memory layout that did not expect that shape. The memory layout adapts. The networking protocol receives the adapted shape and adapts again. By the time the serving runtime sees the data, the data has been adapted four times. Each adaptation costs cycles. Cycles are tokens.

The household that has tried to throw a dinner party with eight family members in two kitchens knows the rule from the inside. The food is fine in either kitchen. The party gets dinner on the table at the right time only when the two kitchens know what the other is doing. The seven layers of an AI platform are the same problem at a different scale.

The fifty-times number is real. It is real because it adds up from the drops, and the drops only stack when the layers were designed knowing about each other.

The reader who can name the seven layers can read the next AI hardware announcement with a different ruler. The spec sheet is the input. The tokens-per-watt number is the output. The gap is the co-design.

The chip is no longer the unit of design. The platform is. The spec sheet is the input. The tokens-per-watt number is the output.

The gap between them is the co-design, and the co-design is the moat that takes ten years of cross-team work to build and one quarter of pretending to have built it to lose. The reader who can name the seven layers reads every chip announcement with the right ruler.

Source

The argument draws on Shrudi Kopakar’s interview with Noah Kravitz on a 2025 AI Podcast about accelerated computing.

Questions readers ask

Six questions on this essay.

01 Why is the new AI chip 50 times faster than the one before it?

The chip itself is not fifty times faster. The chip is about two times better on input metrics like cost per hour and floating-point operations per dollar. The fifty-times number comes from output metrics like tokens per watt and cost per token. The gap between the spec-sheet view and the output view is twenty-five times. That gap lives in extreme co-design across seven layers. Compute, memory, storage, networking, software runtimes, serving software, and the partner ecosystem are all tuned together for one outcome: the lowest cost per token. No single layer delivers fifty times. All of them together do. The right way to read an AI hardware announcement is to ask what the input metric is and what the output metric is. The spec sheet shows the input. The customer pays for the output.

02 What is the difference between integration and co-design?

Integration combines parts that were already designed independently. The parts may work well together. Each was designed for its own goals first. Co-design starts from the outcome and designs every part backward from there. The parts know about each other from the first sketch. The difference sounds like semantics. The difference is the gap between a two-times improvement and a fifty-times improvement. Integration is buying the best stove, the best oven, and the best sink, then arranging them in a kitchen. Co-design is designing the kitchen, the menu, the supplier list, and the dishwasher schedule together for one specific style of cooking. Both kitchens cook. Only the co-designed kitchen runs at peak under load. The same pattern applies to AI hardware. The chip is fine in isolation. The platform is fast when seven layers were designed together.

03 What is extreme co-design?

Extreme co-design is the version of co-design that spans an unusually wide set of layers. Most companies that talk about co-design mean two or three layers, typically a chip and the software that runs on it. NVIDIA uses the word extreme because the co-design spans seven layers. Compute is the silicon. Memory is the chip-adjacent storage. Storage is the rack-scale storage tier. Networking is the chip-to-chip and rack-to-rack data path. Software runtimes are the libraries that move data between hardware components. Serving software is the layer that batches and routes customer requests. The ecosystem is the partner network of silicon partners, original equipment manufacturers, cloud providers, and open-source frameworks. The breadth is the moat. Building one good layer is easy. Building seven, all designed knowing the others, is the harder problem and the bigger gap.

04 Why is cost per GPU hour the wrong metric?

Because the customer does not pay for graphics processing unit hours. The customer pays for tokens. A chip that costs twice as much per hour and produces fifty times more tokens per hour is paying off twenty-five times faster than the spec sheet suggests. Cost per hour is an input metric. Cost per token is an output metric. The customer's bill is the output metric. The two numbers move in different directions on a generational upgrade because the co-design pushes the output metric far faster than the input metric. Reading the input metric and ignoring the output metric is the most common mistake in AI hardware analysis right now. The fix is to look at both numbers and pay attention to the gap. The gap is where the co-design lives.

05 Why do AI hardware platforms have so many chips?

Because the model architecture demands it. A modern frontier model is too large to fit on a single chip. A mixture-of-experts model sends different inputs to different specialist sub-networks. Each sub-network may live on a different chip. The traffic between chips becomes the bottleneck if the platform was not designed for it. The Vera Rubin platform has seven chips inside it because the platform is the unit of design now. The chip is no longer the unit. Comparing the new platform to the old platform on a per-chip basis is the wrong comparison. The right comparison is platform to platform on the workload that runs in production. The same workload on the new platform delivers fifty times more tokens per watt. That is the comparison that matters and the comparison the spec sheet does not show.

06 How does mixture-of-experts work and why does it benefit from co-design?

Mixture-of-experts is a model architecture that splits a large model into many specialist sub-networks. Different inputs activate different specialists. Only a fraction of the model fires on any one input, which makes the architecture efficient at inference time. The cost of that efficiency is heavy traffic between graphics processing units. Each token has to travel to the right specialist, the specialist has to compute, and the result has to travel back. Without co-design, the traffic becomes the bottleneck. The chip sits waiting on the network and the output drops. With co-design, the networking, memory layout, routing software, and serving stack all know in advance that mixture-of-experts traffic will dominate the workload. Each layer is tuned for the traffic pattern. The chips run at output capacity instead of waiting on the data path.

About the author
Hanh D. Brown, writer.

Essayist writing on craft, voice, aging, and what gets harder to say with the years. Twenty years building AI systems for life-stage decisions. Now writing the publication that has the time to ask why.

Read more