If you have followed Artificial Intelligence (AI) hardware announcements and wondered why the generational gaps look so large, this is the place to start. The new chip costs about twice as much per hour and delivers twice the operations per dollar. On the same workload, it ships fifty times more tokens per watt.
Why does Blackwell deliver fifty times the performance when the spec sheet says two?
AI chip performance gains compound through extreme co-design. Blackwell delivers fifty times the throughput of Hopper, not the two times a spec sheet implies. The platform optimizes against the model, the network, the rack, the cooling, and the workload at the same time.
On the spec sheet, 2x. In the real world, 50x#
Pull up the comparison between the current AI accelerator and the one before it. On the spec sheet, the new chip costs about twice as much per hour and delivers twice the floating-point operations per dollar. A two-times improvement. Reasonable.
Now look at the output side. The new chip delivers fifty times more tokens per watt. It produces tokens at thirty-five times lower cost.
The twenty-five times gap between the spec-sheet view and the output view is what this post is about. It is also the reason “compare two chips” is the wrong question for AI hardware in 2026.
The reader is a technical leader, an architect, or a curious practitioner who has read the generational announcements and felt the math did not add up. The math adds up. The math is just being done on the wrong unit.
Comparing two AI chips by their spec sheets is like a buyer choosing a car by the size of the engine alone, ignoring the road and the driver and the route. The engine matters. The engine is not the whole story. The road decides what the engine can deliver.
The spec-sheet number is the input. The tokens-per-watt number is the output. The output is what the customer pays for. The input is what the chip itself was billed at. Both are real. The post is about the gap between them and where that gap comes from.
On the spec sheet, two times. In the real world, fifty times.
Co-design is not integration#
The first move toward explaining the fifty is to nail one piece of vocabulary.
Integration and co-design sound like synonyms. They are not.
Integration combines parts that were already independently designed. The parts may work well together. Each was designed for its own goals first. Co-design starts from the outcome and designs every part backward from there. The parts know about each other from the first sketch.
The difference sounds like semantics. It is not. The difference is the difference between a two-times improvement and a fifty-times improvement.
In modern AI hardware the outcome is the lowest cost per token. The leading chip vendor calls its version of co-design “extreme” because it spans seven layers. Compute. Memory. Storage. Networking. Software runtimes. Serving software. The broader ecosystem of partners and open-source frameworks.
None of those layers alone delivers the fifty. All of them together do.
The modern AI hardware product is not a chip. It is a platform. The Vera Rubin platform has seven chips inside it. The chip is no longer the unit of design. The system is.
Co-design is like a chef designing the kitchen, the menu, the supplier list, and the dishwasher schedule together for one specific style of cooking. Integration is like a homeowner buying the best stove on the market, the best oven on the market, and the best sink on the market, then arranging them in a room. Both kitchens cook. Only the co-designed one runs at peak under load.
The moat is not in any single layer. Anyone can build one good layer. Building all seven, with each layer designed knowing the others, is the harder problem and the bigger gap.
That is why the comparison “two chips on a spec sheet” is the wrong comparison. The thing being compared is no longer the right unit.
A case study: how mixture-of-experts models get their speed#
Vocabulary by itself is not convincing. The way to see co-design is to look at a specific case where it has been done.
Mixture-of-experts models are the clearest case study.
The model architecture sends different inputs to different specialist sub-networks. Different prompts trigger different experts. The architecture is efficient at inference because only a fraction of the model fires for any one input.
The cost of that efficiency is heavy traffic between graphics processing units. Each token’s input has to travel to the right expert, the expert has to compute, and the result has to travel back.
Without co-design, that traffic becomes the bottleneck. The chip waits on the network like a worker at a desk waiting on a phone call that never comes. The output drops. The customer feels it in the bill. The bill is the only number the chief financial officer reads at the end of the month, and the bill is the number the co-design was tuned to lower.
With co-design, the rack-scale hardware is built around the traffic pattern. The networking cable, the memory layout on each card, the routing software, and the serving stack all know in advance that mixture-of-experts traffic will dominate the workload. Each layer is tuned for it. The room of racks behaves like a single machine instead of a hundred separate ones.
The result is the chip running at output capacity instead of waiting on the data path. The fifty-times number is what happens when the seven layers are tuned together for the workload that pays the bill.
This is the case study because mixture-of-experts is also the architecture most large frontier models use. The workload is not exotic. The workload is the workload.
A reader who has watched a sports team get faster after the coach changed the practice schedule knows the pattern from the inside. The athletes are the same. The practice changed. The team is faster because the practice was tuned for the game it plays.
The chips were faster on paper. The platform is faster in the game.
Each optimization is a drop. All of them are the river#
The fifty does not come from one big trick. It comes from dozens of small ones, each contributing a few percent, all compounding.
A faster matrix multiplication primitive. A smarter memory layout for the activations. A networking protocol that overlaps with computation instead of pausing for it. A serving runtime that batches requests intelligently for the model architecture in use. An open-source kernel from a partner that handles one specific operation a few percent better. A compiler pass that fuses operations the old compiler kept separate.
Each optimization is a drop. Each drop is real. None of the drops alone would justify a fifty-times claim. All of them together do.
A hundred small optimizations are like a hundred small leaks that empty the same bathtub. No one leak is the leak. The bathtub is empty in an hour because every leak is dripping at once.
The optimizations stack because the seven layers were designed knowing about each other. The matrix multiplication primitive knows what memory layout to expect. The memory layout knows what the networking protocol will deliver. The networking protocol knows what the serving runtime expects. Each handoff is clean because both sides were designed together.
In a non-co-designed system, every handoff is friction. The matrix multiplication primitive does its work and hands the result to a memory layout that did not expect that shape. The memory layout adapts. The networking protocol receives the adapted shape and adapts again. By the time the serving runtime sees the data, the data has been adapted four times. Each adaptation costs cycles. Cycles are tokens.
The household that has tried to throw a dinner party with eight family members in two kitchens knows the rule from the inside. The food is fine in either kitchen. The party gets dinner on the table at the right time only when the two kitchens know what the other is doing. The seven layers of an AI platform are the same problem at a different scale.
The fifty-times number is real. It is real because it adds up from the drops, and the drops only stack when the layers were designed knowing about each other.
The reader who can name the seven layers can read the next AI hardware announcement with a different ruler. The spec sheet is the input. The tokens-per-watt number is the output. The gap is the co-design.
The chip is no longer the unit of design. The platform is. The spec sheet is the input. The tokens-per-watt number is the output.
The gap between them is the co-design, and the co-design is the moat that takes ten years of cross-team work to build and one quarter of pretending to have built it to lose. The reader who can name the seven layers reads every chip announcement with the right ruler.
The argument draws on Shrudi Kopakar’s interview with Noah Kravitz on a 2025 AI Podcast about accelerated computing.