If you have ever looked at an Artificial Intelligence (AI) bill and wondered whether the number was reasonable, the problem is usually not the price. The problem is the metric. Most AI infrastructure gets bought the way utilities get bought, by the unit of supply. The business sells a different unit.
Why is AI cost per token the right metric and the GPU hour the wrong one?
AI cost per token is the right metric. The graphics processing unit hour is the wrong bill. A token is the product the business sells. The demand math is users times sessions times tokens per session, multiplied by reasoning overhead and agentic fan-out. Cost per token is the only number that compares two quotes.
A token is a product, not a unit of compute#
The default failure mode is treating AI infrastructure as a commodity buy.
Like electricity. Like bandwidth. Like a kilowatt-hour or a megabit per second. Tokens are not electricity. Tokens are products with different values for different jobs.
The business does not sell graphics processing unit (GPU) hours. The business sells tokens, or things made of tokens. A chat reply on a phone. A summarized document on a desk. A line of code in a file. A booked travel itinerary in an email. Each of those is a product the customer can name. Each of those is paid for in tokens behind the scenes.
Token value has two dimensions, not one.
The first dimension is intelligence. The token from a frontier model is worth more than the token from a small one because the frontier model can answer harder questions. Most readers think about intelligence first because intelligence is what the press releases talk about.
The second dimension is interactivity. The token that arrives in two seconds is worth more than the token that arrives in two minutes, even at equal intelligence, because the user is waiting on it. Interactivity is the dimension most people forget.
The two together set what a token is worth, and what price the token can command.
The most expensive token is not always the right one for the job. A small fine-tuned model can beat a giant one on a narrow task, and at a fraction of the cost. The value of a token is relative to the work the user actually needs done.
Treating tokens as a single commodity is like a buyer in a wine store pricing every bottle by the milliliter at a desk in the back room. The math is precise. The pricing is wrong. The wine the buyer wanted is the wine the buyer overpaid for.
The reader who can name where their workload sits on the two-axis grid is the reader whose AI budget is built on the right unit. The rest of the post is what to do with that unit.
The demand math is just math#
Now that you know what a token is, the next question is how many of them you need.
This is where most AI capacity decisions go from “guess” to “calculation.”
The base math fits on a napkin. Users multiplied by sessions multiplied by tokens per session. That is the floor.
The number you actually need is the floor multiplied by a series of factors the source names directly. Reasoning models add invisible thinking tokens. Agentic workflows fan out into many model calls per user prompt. The key-value (KV) cache hit rate reduces effective demand. Time of day and user growth shape the curve.
The math is not the hard part. Sitting down to do it is.
Reasoning models consume tokens the user never sees. The model works through the problem in a kind of private monologue, and every line of that monologue is a token the business pays for. Set a threshold on the number of thinking tokens per interaction. Treat it as a planning lever, not a detail.
Agentic workflows are the surprise factor. A single user prompt like “book a ticket to Miami” can fan out into many model calls under the hood. The primary agent reasons, calls a sub-agent, waits, calls another tool, reasons again. Token demand for one user action can be many times what a chatbot would consume for the same prompt.
The KV cache is the model’s short-term memory of recent context. A high cache hit rate cuts the work the model has to redo. A low hit rate makes every request feel like the first one. The budget moves with the rate.
Time of day matters because users arrive in waves. The peak demand at one in the afternoon on a Tuesday in a U.S. office can be five times the floor. The capacity has to fit the peak, not the average, or the user waits.
Demand for tokens is like a household’s electricity bill. The base math is simple. The factors that change the bill are not obvious. The household that does the math once a year is not surprised at the end of the month. The household that does not is.
The reader who runs the demand math on a kitchen table on a Sunday morning has a number to take into the next infrastructure conversation. The reader who skips the math has a guess.
Cost per GPU hour is the wrong metric#
This is the H2 that turns a reader from someone who can read an AI quote into someone who can evaluate one.
Most AI infrastructure decisions get made on input metrics. Cost per GPU hour. Floating-point operations per dollar. Both numbers are easy to find. Both look precise. Both miss the point.
The business does not sell GPU hours. The business sells tokens.
The metric that combines what you are paying with what you are producing is cost per token. It is the only metric that lets a buyer compare two infrastructure quotes apples to apples.
A team budgeting on cost per GPU hour can buy what looks like the right capacity and still end up short of tokens. The GPU hour says nothing about how productive that hour will be. A new generation of hardware produces many more tokens per hour than the previous generation. The hour costs more. The token costs less. The team that only watched the hour cost missed the win.
Cost per GPU hour measures the size of the engine. Cost per token measures what came off the line. The engine size is interesting. The output is the business.
The one sentence to take into any vendor conversation is this: “What is your cost per token on my workload?” Not your cost per hour. Not your operations per dollar. Your cost per token on the workload the buyer’s business actually runs.
A buyer who walks into a vendor meeting like a buyer at a custom shop with a measured cut list goes home with the right material. A buyer who walks in like a tourist comparing brochures goes home with whatever the brochure pictured.
The vendor who cannot answer the cost-per-token question is selling on input metrics. The vendor who can answer it is selling on output. The buyer who learns to ask the question changes what gets sold.
Price the token, then watch the bill grow anyway#
Here is the part that surprises most chief financial officers.
The cost per token is falling. The bill is growing.
Both numbers are true at the same time. They are not in conflict. They are how the math of AI infrastructure works.
The unit price of a token is falling because the hardware is getting better and the model providers are competing for the buyer. Cost per token is the line item that gives the chief executive a good story for the board.
The total bill is growing because demand is outpacing the price cut. The business is using more tokens per user. More users are signing up. Each user is running agentic workflows that consume many tokens per action. Reasoning models are spending more tokens per question. The denominator of the per-token cost is shrinking, but the numerator of the total bill is climbing faster.
This is the same shape as a household water bill in a drought year. The unit price of water falls because the city is subsidizing conservation. The total bill rises because the household is watering the garden more aggressively to keep the lawn alive. Both numbers move. The household reads the bill and is surprised.
The fix is to budget both numbers. Track the cost per token to evaluate the vendor. Track the total bill to evaluate the business. Treat the two as separate lines that move on separate clocks.
The chief financial officer who reads only the unit price misses the bill. The chief financial officer who reads only the bill misses the win the unit price represents. The reader who reads both is reading the AI economy with the right map.
The bill grows in the year the business succeeds at using AI. The bill stays flat in the year the business is not using AI. A flat AI bill is not a sign of cost discipline. It is a sign of underuse.
The token is the product. The bill is the business. Both numbers move on different clocks. The buyer who can read the unit price as the vendor’s win and the total bill as the company’s use has a framework the chief financial officer can defend.
The bill that grows in the year of growth is the right bill. The bill that stays flat in the year of growth is the wrong one.
The argument draws on Shrudi Kopakar’s interview with Noah Kravitz on a 2025 AI Podcast about accelerated computing.