An essay on AI

AI Cost Per Token: Why GPU Hour Is the Wrong Bill

AI bills come in confusing units. The fix is to budget for the product the business sells: tokens. Cost per token is the only metric that compares two quotes.

Hanh D. Brown · 11 min read

An essay on AI

AI Cost Per Token: Why GPU Hour Is the Wrong Bill

AI bills come in confusing units. The fix is to budget for the product the business sells: tokens. Cost per token is the only metric that compares two quotes.

Hanh D. Brown May 28, 2026

Two adjacent volcanic crater lakes of different colors, pale turquoise and dark teal, viewed at sunrise from a narrow rocky ridge.

In this essay

01 A token is a product, not a unit of compute
02 The demand math is just math
03 Cost per GPU hour is the wrong metric
04 Price the token, then watch the bill grow anyway

If you have ever looked at an Artificial Intelligence (AI) bill and wondered whether the number was reasonable, the problem is usually not the price. The problem is the metric. Most AI infrastructure gets bought the way utilities get bought, by the unit of supply. The business sells a different unit.

Short answer

Why is AI cost per token the right metric and the GPU hour the wrong one?

AI cost per token is the right metric. The graphics processing unit hour is the wrong bill. A token is the product the business sells. The demand math is users times sessions times tokens per session, multiplied by reasoning overhead and agentic fan-out. Cost per token is the only number that compares two quotes.

A token is a product, not a unit of compute#

A finance team opening its first AI invoice tends to file it next to the electricity and the bandwidth, one more metered commodity.

That instinct prices every unit the same, the way a utility meters a kilowatt-hour or a megabit. Tokens do not work that way. Tokens are products with different values for different jobs.

What the business does not sell is graphics processing unit (GPU) hours. The business sells tokens, or things made of tokens. A chat reply on a phone. A summarized document on a desk. A line of code in a file. A booked travel itinerary in an email. Each of those is a product the customer can name. Each of those is paid for in tokens behind the scenes.

Token value has two dimensions, not one.

Two-axis spectrum showing token value plotted against intelligence on one axis and interactivity (speed of arrival) on the other, with examples of where different models and use cases land on the grid — Source: Intelligence on one axis. Interactivity on the other. The value of a token is the position on the grid, not the line item on the invoice.

The first dimension is intelligence. A token from a frontier model is worth more than a token from a small one because the frontier model can answer harder questions, the capability the press releases talk about. Most readers think about intelligence first for exactly that reason.

Interactivity is the second dimension, and it is the one most people forget. A token that arrives in two seconds is worth more than the same token arriving in two minutes, even at equal intelligence, because the user is waiting on it.

Together the two axes set what a token is worth, and what price the token can command.

Notice that the most expensive token is not always the right one for the job. A small fine-tuned model can beat a giant one on a narrow task, and at a fraction of the cost. The value of a token is relative to the work the user actually needs done.

Treating tokens as a single commodity is like a buyer in a wine store pricing every bottle by the milliliter at a desk in the back room. The math is precise. The pricing is wrong. The wine the buyer wanted is the wine the buyer overpaid for.

Name where your workload sits on the two-axis grid and your AI budget is built on the right unit.

The demand math is just math#

With the unit settled, the next question is how many tokens you need.

Base math fits on a napkin. Users multiplied by sessions multiplied by tokens per session. That is the floor.

Real demand is that floor multiplied by a series of factors the source names directly. Reasoning models add invisible thinking tokens. Agentic workflows fan out into many model calls per user prompt. The key-value (KV) cache hit rate reduces effective demand. Time of day and user growth shape the curve.

None of that math is the hard part. Sitting down to do it is.

Reasoning models consume tokens the user never sees. The model works through the problem in a kind of private monologue, and every line of that monologue is a token the business pays for. Set a threshold on the number of thinking tokens per interaction. Treat it as a planning lever, not a detail.

Agentic workflows are the surprise factor. A single user prompt like “book a ticket to Miami” can fan out into many model calls, the kind of agent that runs like a hired employee. The primary agent reasons, calls a sub-agent, waits, calls another tool, reasons again. Token demand for one user action can be many times what a chatbot would consume for the same prompt.

Short-term memory of recent context lives in the KV cache. A high cache hit rate cuts the work the model has to redo. A low hit rate makes every request feel like the first one. The budget moves with the rate.

Time of day matters because users arrive in waves. The peak demand at one in the afternoon on a Tuesday in a U.S. office can be five times the floor. The capacity has to fit the peak, not the average, or the user waits.

Token demand works the same way. The base math is simple. The factors that change the bill are not obvious. The team that does the math once is not surprised at the end of the month. The team that does not is.

Run the demand math on a kitchen table on a Sunday morning and you have a number to take into the next infrastructure conversation. The reader who skips the math has a guess.

Cost per GPU hour is the wrong metric#

Most AI infrastructure decisions get made on input metrics. Cost per GPU hour. Floating-point operations per dollar. Both numbers are easy to find. Both look precise. Both miss the point.

Again, the business does not sell GPU hours. The business sells tokens.

One metric combines what you are paying with what you are producing, and that is cost per token. It is the only metric that lets a buyer compare two infrastructure quotes apples to apples.

A team budgeting on cost per GPU hour can buy what looks like the right capacity and still run short of tokens. The GPU hour says nothing about how productive it will be. A new generation of hardware produces far more tokens per hour than the last. The hour costs more. The token costs less. The team watching only the hour cost missed the win.

Cost per GPU hour measures the size of the engine. Cost per token measures what came off the line. The engine size is interesting. The output is the business.

One sentence to take into any vendor conversation carries the whole point: “What is your cost per token on my workload?” Not your cost per hour. Not your operations per dollar. Your cost per token on the workload the buyer’s business actually runs.

A buyer who walks into a vendor meeting with a measured demand number for their own workload gets capacity sized to the work. A buyer who walks in with only the vendor’s headline figures gets whatever the pitch was built to show.

Any vendor who cannot answer the cost-per-token question is selling on input metrics. The vendor who can answer it is selling on output. The buyer who learns to ask the question changes what gets sold.

Price the token, then watch the bill grow anyway#

The cost per token is falling. The bill is growing.

Both numbers are true at the same time. They are not in conflict. They are how the math of AI infrastructure works.

Diagram of the chain from cost per token through workload mix and demand growth to the customer's monthly bill, showing how the bill can grow even while the unit price falls — Source: Falling cost per token is the headline. Rising bill is the result when demand outpaces the price cut.

Unit price is falling because the hardware is getting better and the model providers are competing for the buyer. Cost per token is the line item that gives the chief executive a good story for the board.

Total bill is growing because demand is outpacing the price cut. The business is using more tokens per user. More users are signing up. Each user is running agentic workflows that consume many tokens per action. Reasoning models are spending more tokens per question. The denominator of the per-token cost is shrinking, but the numerator of the total bill is climbing faster.

Both facts hold at once because they track different things. One measures what each token costs. The other measures how many tokens the business consumes. A buyer watching only one of them is surprised by the other.

So budget both numbers. Track the cost per token to evaluate the vendor. Track the total bill to evaluate the business. Treat the two as separate lines that move on separate clocks.

Read only the unit price and a chief financial officer misses the bill. Read only the bill and you miss the win the unit price represents. A household that watches only the price of electricity and never the size of the check is surprised the same way. The reader who reads both is reading the AI economy with the right map.

Watch which year the bill grows. It grows in the year the business succeeds at using AI, and it stays flat in the year the business is not using AI. A flat AI bill is not a sign of cost discipline. It is a sign of underuse.

The token is the product. The bill is the business. Both numbers move on different clocks. The buyer who can read the unit price as the vendor’s win and the total bill as the company’s use has a framework the chief financial officer can defend.

Growth years should show a growing bill, and that is the right bill. The bill that stays flat in the year of growth is the wrong one.

Source

The argument draws on Shrudi Kopakar’s interview with Noah Kravitz on a 2025 AI Podcast about accelerated computing.

Questions readers ask

Six questions on this essay.

01 What is tokenomics?

Tokenomics is the budget framework that treats the token as the product the business sells. Most companies buying AI infrastructure budget for the input: cost per graphics processing unit hour or floating-point operations per dollar. The business does not sell graphics processing unit hours. The business sells tokens, or things made of tokens like chat replies, summarized documents, lines of code, or booked itineraries. Tokenomics flips the budget from the input unit to the output unit. The first move is to recognize that token value has two dimensions, intelligence and interactivity. The second is to do the demand math from users, sessions, and tokens per session, multiplied by workload-specific factors. The third is to measure cost per token. The fourth is to track the total bill separately from the unit price, because the two move at different speeds.

02 Are all tokens worth the same?

No. Token value has two dimensions. Intelligence is the first one, and it determines how hard a question the model can answer. A frontier model produces tokens worth more on a hard task than a small model produces on the same task. Interactivity is the second dimension, and it determines how fast the token arrives. A token that arrives in two seconds is worth more than the same token arriving in two minutes, because the user is waiting on it. The most expensive token is not always the right one for the job. A small fine-tuned model can beat a giant one on a narrow task at a fraction of the cost. The value of a token is relative to the work the user actually needs done. Reading the two axes together is how a buyer matches the right token to the right workload.

03 How do you estimate AI token demand for a workload?

Start with the base formula. Users multiplied by sessions multiplied by tokens per session. That is the floor. Then multiply by the workload-specific factors. Reasoning models add invisible thinking tokens the user never sees. Set a threshold and treat it as a planning lever. Agentic workflows fan out into many model calls per user prompt. A single prompt like book a ticket to Miami can consume many times what a chatbot would for the same prompt. The key-value cache hit rate reduces effective demand by reusing recent context. The hit rate is a number worth measuring. Time-of-day patterns matter because users arrive in waves and capacity has to fit the peak, not the average. Seasonal patterns and user growth shape the curve over the year. The math is not hard. Sitting down to do it is.

04 Why are agentic workflows so expensive to run?

Because each user prompt fans out into many model calls under the hood. The primary agent reads the prompt, reasons about it, calls a sub-agent to do part of the work, waits for the sub-agent to finish, calls another tool to retrieve information, reasons again, and finally answers the user. Each of those steps is a model call. Each model call consumes tokens. A single user action can fire a dozen or more model calls inside the agent loop. Capacity planning that treats one user prompt as one model call will be short by an order of magnitude on an agentic workload. The fix is to plan the demand math with an agentic multiplier in the equation. The multiplier varies by workload but is rarely small. Reasoning models add a second multiplier on top of the agentic one because each step has its own thinking tokens.

05 What is the key-value cache and why does it matter for budget?

The key-value cache is the model's short-term memory of recent context. When a user sends a follow-up message in the same conversation, the model can reuse the cached representation of the earlier turns instead of recomputing them from scratch. A high cache hit rate cuts the work the model has to do, which cuts the token bill. A low hit rate makes every request feel like the first one, which raises the bill. The hit rate depends on how the workload is structured. Long conversations with one user have high hit rates. Short bursts from many users have low ones. The budget conversation should include the cache hit rate as a planning input because the same workload can cost very different amounts depending on how the cache is being used by the serving stack.

06 Why does my AI bill keep growing even when the cost per token falls?

Because demand is growing faster than the price is falling. Both numbers are true at the same time. The unit price of a token falls because hardware improves and model providers compete for the buyer. The total bill grows because the business uses more tokens per user, more users sign up, agentic workflows consume more tokens per action, and reasoning models spend more tokens per question. The denominator of the per-token cost shrinks. The numerator of the total bill climbs faster. The fix is to budget both numbers separately. Track cost per token to evaluate the vendor. Track total bill to evaluate the business. A flat AI bill is not a sign of cost discipline. A flat AI bill is a sign the business is not using AI to drive growth, which is its own problem on a different line of the budget.

About the author

Hanh D. Brown, writer.

Hanh D. Brown writes on AI, aging, and the decisions in between. Twenty years building systems for life-stage choices, now writing the publication with time to ask why.

Subscribe: a new essay when it's finished, never before. Join readers thinking about AI, aging, and the decisions in between.

Subscribe From the work See the work

A token is a product, not a unit of compute#

The demand math is just math#

Cost per GPU hour is the wrong metric#

Price the token, then watch the bill grow anyway#

Six questions on this essay.

AI Orchestration: The Blind Spot Is Not the Model

How AI Chips Work: The Pause Is Where the Work Lives

AI Chip Performance: Why Blackwell Delivers 50x, Not 2x