Your AI vendor's pricing will change. Plan for it.
OpenAI has changed pricing 4 times in 18 months. Anthropic twice. Google three times. If your unit economics depend on current API pricing, they are fiction.
Someone on your team built a spreadsheet. It says your AI feature costs $0.04 per query. It multiplies that by projected volume. It shows a healthy margin. Everyone feels good.
That spreadsheet is fiction. Not because the math is wrong. Because the price it is based on will change — and you do not know when, by how much, or in which direction.
The price history
OpenAI launched GPT-4 at $0.03 per 1K input tokens. Then GPT-4 Turbo dropped it to $0.01. Then GPT-4o dropped it further. Then they introduced cached input pricing at half the rate. Then reasoning models arrived with a different pricing structure entirely — thinking tokens that you pay for but never see.
Anthropic launched Claude 3 Opus at one price, then introduced Claude 3.5 Sonnet at a fraction of the cost with better performance. Then prompt caching changed the math again.
Google has repriced Gemini models multiple times, introduced context caching, and restructured their API tiers.
In the last 18 months, no major AI API provider has kept their pricing stable for more than six months. Prices have generally gone down — which is good — but the pricing model has changed in ways that make forecasting difficult.
Per-token vs. per-request. Input vs. output pricing. Cached vs. uncached. Thinking tokens vs. completion tokens. Batch vs. real-time. Each of these is a different axis that can shift your unit economics.
The danger of coupling
When your business model is tightly coupled to current API pricing, you have a fragile system. Here is how it breaks:
Scenario 1: Prices go up. A provider introduces a new model that is better for your use case, but more expensive. You want to upgrade. Your margin disappears. You now have to choose between product quality and profitability — a choice you should never have to make on short notice.
Scenario 2: Pricing model changes. Your cost model assumes per-token pricing. The provider introduces per-request pricing with a token cap. Your short queries get more expensive. Your long queries get cheaper. Your aggregate cost shifts unpredictably.
Scenario 3: You need to switch providers. Your primary provider has an outage. Or they deprecate your model. Or a competitor releases something significantly better. If switching requires re-deriving your unit economics from scratch, you will move slowly — and in AI, moving slowly is expensive.
Scenario 4: Thinking tokens. You are using a reasoning model. Your cost model counts input tokens and output tokens. But the model is generating 5x more thinking tokens than output tokens — tokens you pay for, that do not appear in the response, and that vary wildly based on query complexity. Your cost-per-query variance goes from 10% to 300%.
The 2x buffer
The simplest defense: build your cost model with a 2x buffer on API costs. If your current cost is $0.04 per query, model your economics as if it were $0.08. If the business still works at $0.08, you have room to absorb pricing changes without panic.
This sounds conservative. It is. That is the point.
The 2x buffer is not waste — it is insurance against being forced into a bad decision when prices shift. And if prices continue to drop, the buffer becomes margin. You do not lose by being conservative here.
Abstract the model layer
Your application code should not know which model it is calling. It should call an internal abstraction — a model service, a gateway, a routing layer — that handles model selection, fallback, and cost tracking.
This is not over-engineering. It is the same pattern we use for every external dependency. You do not hardcode database connection strings. You do not hardcode payment processor endpoints. Do not hardcode model endpoints.
The abstraction should handle:
- Model selection. Route queries to different models based on complexity, cost, or latency requirements.
- Fallback. If the primary model is down or slow, fall back to an alternative.
- Cost tagging. Tag every API call with the feature, user segment, or query type that triggered it. This data is essential for understanding where your money goes.
- Rate limiting. Enforce per-feature or per-user rate limits to prevent cost spikes.
Building this abstraction takes a week. It pays for itself the first time you need to switch models — which, given the pace of this market, will be within a quarter.
Monitor cost per query
You track latency per endpoint. You track error rate per service. You should track cost per query for every AI feature.
Not aggregate monthly cost — cost per query, broken down by feature, model, and query type. This is how you spot problems early.
Useful metrics:
- p50, p90, p99 cost per query. The median is interesting. The p99 tells you about the expensive outliers — the queries that trigger long reasoning chains or retrieve large contexts.
- Cost per query by feature. One feature might account for 70% of your spend but 20% of your queries. That is worth knowing.
- Cost trend over time. A gradual increase in cost per query can indicate prompt drift, retrieval bloat, or model behavior changes that have nothing to do with pricing.
Set alerts on cost anomalies. A sudden spike in cost per query might mean a prompt change is causing longer outputs, a retrieval bug is stuffing more context into the prompt, or a model update changed the tokenization.
Plan for both directions
Most teams plan for prices going down. They assume API costs will shrink over time, and they are probably right in aggregate.
But plan for prices going up on specific capabilities. Better reasoning costs more. Longer context windows cost more. Multimodal capabilities cost more. The frontier model you will want in six months may be more expensive than the model you are using today, even if the model you are using today gets cheaper.
The right mental model is not “AI will get cheaper” but “the cost-performance frontier will shift.” You will have more options at every price point. But the option you want — the one that makes your feature significantly better — may be at a higher price point than you are at today.
The heuristic
Never build unit economics on current API pricing without a 2x buffer. Abstract your model layer so switching is a config change, not a rewrite. Monitor cost per query in production the same way you monitor latency. The price will change. The question is whether you are ready when it does.
tl;dr
The pattern. Teams build financial models on current API token prices even though every major AI provider has repriced, restructured billing, or introduced new cost axes — thinking tokens, cached input tiers, per-request caps — multiple times in the last 18 months. The fix. Apply a 2x buffer to all AI cost assumptions, route every model call through an internal abstraction layer so switching providers is a config change, and monitor cost per query in production as a first-class metric. The outcome. Pricing changes become a business decision you are prepared to make rather than an emergency that forces you to choose between product quality and profitability.