An essay on AI
AI Cost Per Token: Why GPU Hour Is the Wrong Bill
AI bills come in confusing units. The fix is to budget for the product the business sells: tokens. Cost per token is the only metric that compares two quotes.
An essay on AI

AI Cost Per Token: Why GPU Hour Is the Wrong Bill

AI bills come in confusing units. The fix is to budget for the product the business sells: tokens. Cost per token is the only metric that compares two quotes.

Two adjacent volcanic crater lakes of different colors, pale turquoise and dark teal, viewed at sunrise from a narrow rocky ridge.

If you have ever looked at an Artificial Intelligence (AI) bill and wondered whether the number was reasonable, the problem is usually not the price. The problem is the metric. Most AI infrastructure gets bought the way utilities get bought, by the unit of supply. The business sells a different unit.

Short answer

Why is AI cost per token the right metric and the GPU hour the wrong one?

AI cost per token is the right metric. The graphics processing unit hour is the wrong bill. A token is the product the business sells. The demand math is users times sessions times tokens per session, multiplied by reasoning overhead and agentic fan-out. Cost per token is the only number that compares two quotes.

A token is a product, not a unit of compute#

The default failure mode is treating AI infrastructure as a commodity buy.

Like electricity. Like bandwidth. Like a kilowatt-hour or a megabit per second. Tokens are not electricity. Tokens are products with different values for different jobs.

The business does not sell graphics processing unit (GPU) hours. The business sells tokens, or things made of tokens. A chat reply on a phone. A summarized document on a desk. A line of code in a file. A booked travel itinerary in an email. Each of those is a product the customer can name. Each of those is paid for in tokens behind the scenes.

Token value has two dimensions, not one.

Two-axis spectrum showing token value plotted against intelligence on one axis and interactivity (speed of arrival) on the other, with examples of where different models and use cases land on the grid
Source: Intelligence on one axis. Interactivity on the other. The value of a token is the position on the grid, not the line item on the invoice.

The first dimension is intelligence. The token from a frontier model is worth more than the token from a small one because the frontier model can answer harder questions. Most readers think about intelligence first because intelligence is what the press releases talk about.

The second dimension is interactivity. The token that arrives in two seconds is worth more than the token that arrives in two minutes, even at equal intelligence, because the user is waiting on it. Interactivity is the dimension most people forget.

The two together set what a token is worth, and what price the token can command.

The most expensive token is not always the right one for the job. A small fine-tuned model can beat a giant one on a narrow task, and at a fraction of the cost. The value of a token is relative to the work the user actually needs done.

Treating tokens as a single commodity is like a buyer in a wine store pricing every bottle by the milliliter at a desk in the back room. The math is precise. The pricing is wrong. The wine the buyer wanted is the wine the buyer overpaid for.

The reader who can name where their workload sits on the two-axis grid is the reader whose AI budget is built on the right unit. The rest of the post is what to do with that unit.

The demand math is just math#

Now that you know what a token is, the next question is how many of them you need.

This is where most AI capacity decisions go from “guess” to “calculation.”

The base math fits on a napkin. Users multiplied by sessions multiplied by tokens per session. That is the floor.

The number you actually need is the floor multiplied by a series of factors the source names directly. Reasoning models add invisible thinking tokens. Agentic workflows fan out into many model calls per user prompt. The key-value (KV) cache hit rate reduces effective demand. Time of day and user growth shape the curve.

The math is not the hard part. Sitting down to do it is.

Reasoning models consume tokens the user never sees. The model works through the problem in a kind of private monologue, and every line of that monologue is a token the business pays for. Set a threshold on the number of thinking tokens per interaction. Treat it as a planning lever, not a detail.

Agentic workflows are the surprise factor. A single user prompt like “book a ticket to Miami” can fan out into many model calls under the hood. The primary agent reasons, calls a sub-agent, waits, calls another tool, reasons again. Token demand for one user action can be many times what a chatbot would consume for the same prompt.

The KV cache is the model’s short-term memory of recent context. A high cache hit rate cuts the work the model has to redo. A low hit rate makes every request feel like the first one. The budget moves with the rate.

Time of day matters because users arrive in waves. The peak demand at one in the afternoon on a Tuesday in a U.S. office can be five times the floor. The capacity has to fit the peak, not the average, or the user waits.

Demand for tokens is like a household’s electricity bill. The base math is simple. The factors that change the bill are not obvious. The household that does the math once a year is not surprised at the end of the month. The household that does not is.

The reader who runs the demand math on a kitchen table on a Sunday morning has a number to take into the next infrastructure conversation. The reader who skips the math has a guess.

Cost per GPU hour is the wrong metric#

This is the H2 that turns a reader from someone who can read an AI quote into someone who can evaluate one.

Most AI infrastructure decisions get made on input metrics. Cost per GPU hour. Floating-point operations per dollar. Both numbers are easy to find. Both look precise. Both miss the point.

The business does not sell GPU hours. The business sells tokens.

The metric that combines what you are paying with what you are producing is cost per token. It is the only metric that lets a buyer compare two infrastructure quotes apples to apples.

A team budgeting on cost per GPU hour can buy what looks like the right capacity and still end up short of tokens. The GPU hour says nothing about how productive that hour will be. A new generation of hardware produces many more tokens per hour than the previous generation. The hour costs more. The token costs less. The team that only watched the hour cost missed the win.

Cost per GPU hour measures the size of the engine. Cost per token measures what came off the line. The engine size is interesting. The output is the business.

The one sentence to take into any vendor conversation is this: “What is your cost per token on my workload?” Not your cost per hour. Not your operations per dollar. Your cost per token on the workload the buyer’s business actually runs.

A buyer who walks into a vendor meeting like a buyer at a custom shop with a measured cut list goes home with the right material. A buyer who walks in like a tourist comparing brochures goes home with whatever the brochure pictured.

The vendor who cannot answer the cost-per-token question is selling on input metrics. The vendor who can answer it is selling on output. The buyer who learns to ask the question changes what gets sold.

Price the token, then watch the bill grow anyway#

Here is the part that surprises most chief financial officers.

The cost per token is falling. The bill is growing.

Both numbers are true at the same time. They are not in conflict. They are how the math of AI infrastructure works.

Diagram of the chain from cost per token through workload mix and demand growth to the customer's monthly bill, showing how the bill can grow even while the unit price falls
Source: Falling cost per token is the headline. Rising bill is the result when demand outpaces the price cut.

The unit price of a token is falling because the hardware is getting better and the model providers are competing for the buyer. Cost per token is the line item that gives the chief executive a good story for the board.

The total bill is growing because demand is outpacing the price cut. The business is using more tokens per user. More users are signing up. Each user is running agentic workflows that consume many tokens per action. Reasoning models are spending more tokens per question. The denominator of the per-token cost is shrinking, but the numerator of the total bill is climbing faster.

This is the same shape as a household water bill in a drought year. The unit price of water falls because the city is subsidizing conservation. The total bill rises because the household is watering the garden more aggressively to keep the lawn alive. Both numbers move. The household reads the bill and is surprised.

The fix is to budget both numbers. Track the cost per token to evaluate the vendor. Track the total bill to evaluate the business. Treat the two as separate lines that move on separate clocks.

The chief financial officer who reads only the unit price misses the bill. The chief financial officer who reads only the bill misses the win the unit price represents. The reader who reads both is reading the AI economy with the right map.

The bill grows in the year the business succeeds at using AI. The bill stays flat in the year the business is not using AI. A flat AI bill is not a sign of cost discipline. It is a sign of underuse.

The token is the product. The bill is the business. Both numbers move on different clocks. The buyer who can read the unit price as the vendor’s win and the total bill as the company’s use has a framework the chief financial officer can defend.

The bill that grows in the year of growth is the right bill. The bill that stays flat in the year of growth is the wrong one.

Source

The argument draws on Shrudi Kopakar’s interview with Noah Kravitz on a 2025 AI Podcast about accelerated computing.

Questions readers ask

Six questions on this essay.

01 What is tokenomics?

Tokenomics is the budget framework that treats the token as the product the business sells. Most companies buying AI infrastructure budget for the input: cost per graphics processing unit hour or floating-point operations per dollar. The business does not sell graphics processing unit hours. The business sells tokens, or things made of tokens like chat replies, summarized documents, lines of code, or booked itineraries. Tokenomics flips the budget from the input unit to the output unit. The first move is to recognize that token value has two dimensions, intelligence and interactivity. The second is to do the demand math from users, sessions, and tokens per session, multiplied by workload-specific factors. The third is to measure cost per token. The fourth is to track the total bill separately from the unit price, because the two move at different speeds.

02 Are all tokens worth the same?

No. Token value has two dimensions. Intelligence is the first one, and it determines how hard a question the model can answer. A frontier model produces tokens worth more on a hard task than a small model produces on the same task. Interactivity is the second dimension, and it determines how fast the token arrives. A token that arrives in two seconds is worth more than the same token arriving in two minutes, because the user is waiting on it. The most expensive token is not always the right one for the job. A small fine-tuned model can beat a giant one on a narrow task at a fraction of the cost. The value of a token is relative to the work the user actually needs done. Reading the two axes together is how a buyer matches the right token to the right workload.

03 How do you estimate AI token demand for a workload?

Start with the base formula. Users multiplied by sessions multiplied by tokens per session. That is the floor. Then multiply by the workload-specific factors. Reasoning models add invisible thinking tokens the user never sees. Set a threshold and treat it as a planning lever. Agentic workflows fan out into many model calls per user prompt. A single prompt like book a ticket to Miami can consume many times what a chatbot would for the same prompt. The key-value cache hit rate reduces effective demand by reusing recent context. The hit rate is a number worth measuring. Time-of-day patterns matter because users arrive in waves and capacity has to fit the peak, not the average. Seasonal patterns and user growth shape the curve over the year. The math is not hard. Sitting down to do it is.

04 Why are agentic workflows so expensive to run?

Because each user prompt fans out into many model calls under the hood. The primary agent reads the prompt, reasons about it, calls a sub-agent to do part of the work, waits for the sub-agent to finish, calls another tool to retrieve information, reasons again, and finally answers the user. Each of those steps is a model call. Each model call consumes tokens. A single user action can fire a dozen or more model calls inside the agent loop. Capacity planning that treats one user prompt as one model call will be short by an order of magnitude on an agentic workload. The fix is to plan the demand math with an agentic multiplier in the equation. The multiplier varies by workload but is rarely small. Reasoning models add a second multiplier on top of the agentic one because each step has its own thinking tokens.

05 What is the key-value cache and why does it matter for budget?

The key-value cache is the model's short-term memory of recent context. When a user sends a follow-up message in the same conversation, the model can reuse the cached representation of the earlier turns instead of recomputing them from scratch. A high cache hit rate cuts the work the model has to do, which cuts the token bill. A low hit rate makes every request feel like the first one, which raises the bill. The hit rate depends on how the workload is structured. Long conversations with one user have high hit rates. Short bursts from many users have low ones. The budget conversation should include the cache hit rate as a planning input because the same workload can cost very different amounts depending on how the cache is being used by the serving stack.

06 Why does my AI bill keep growing even when the cost per token falls?

Because demand is growing faster than the price is falling. Both numbers are true at the same time. The unit price of a token falls because hardware improves and model providers compete for the buyer. The total bill grows because the business uses more tokens per user, more users sign up, agentic workflows consume more tokens per action, and reasoning models spend more tokens per question. The denominator of the per-token cost shrinks. The numerator of the total bill climbs faster. The fix is to budget both numbers separately. Track cost per token to evaluate the vendor. Track total bill to evaluate the business. A flat AI bill is not a sign of cost discipline. A flat AI bill is a sign the business is not using AI to drive growth, which is its own problem on a different line of the budget.

About the author
Hanh D. Brown, writer.

Essayist writing on craft, voice, aging, and what gets harder to say with the years. Twenty years building AI systems for life-stage decisions. Now writing the publication that has the time to ask why.

Read more