Caching LLM responses is not cheating
Semantic caching can cut your LLM costs by 40-60% and your latency by 90%. Most teams don't do it because it feels like they're 'not really using AI.' They are wrong.
There is a strange guilt that settles over teams when someone suggests caching LLM responses. It feels like cheating. Like the whole point was to have a model think about each query fresh. Like serving a cached response means you are not really using AI.
This is wrong. Caching is infrastructure. And infrastructure that makes your system faster, cheaper, and more predictable is not cheating — it is engineering.
The stigma
We have seen this pattern at multiple clients. Someone builds an AI feature. It works. It goes to production. The bill arrives. Someone on the team says, “We could cache the common queries.” And someone else — usually someone who championed the AI feature — pushes back. “If we’re just serving cached responses, why did we build an AI system?”
Because not every query needs fresh inference. Most queries don’t.
Look at your production logs. You will find that 30-50% of queries are semantically identical to queries you have already answered. Same question, different phrasing. Same intent, slightly different words. You are paying for a fresh API call each time, waiting 2-4 seconds each time, and getting roughly the same answer each time.
That is not engineering. That is waste.
Three tiers of caching
Not all caching is the same. There are three tiers, each with a different complexity-to-payoff ratio.
Exact match caching. Hash the prompt. If you have seen this exact prompt before, return the cached response. Implementation: a key-value store. Redis, DynamoDB, even an in-memory dictionary for low-traffic systems. Zero ambiguity, zero risk. If the prompt is identical, the response is valid.
This alone will catch 10-20% of queries in most production systems. Users copy-paste. Automated workflows send the same prompt repeatedly. Internal tools hit the same questions daily.
Semantic caching. Embed the incoming query. Compare it to a store of previously seen query embeddings. If the cosine similarity exceeds a threshold — typically 0.95 or higher — return the cached response.
This is where it gets interesting. “What’s our refund policy?” and “How do I get a refund?” are different strings but the same question. Semantic caching catches these. Implementation is slightly more involved — you need an embedding model and a vector store — but if you already have a RAG pipeline, you already have both.
Semantic caching typically catches an additional 20-40% of queries on top of exact match caching. The key is tuning the similarity threshold. Too low and you serve wrong answers. Too high and you cache nothing. Start at 0.97 and lower it gradually while monitoring quality.
Tiered caching with freshness. Cache responses for common queries. Serve live inference for novel ones. Set a TTL on cached responses so they refresh when underlying data changes. Tag cache entries by data source so you can invalidate selectively when a source is updated.
This is the production-grade approach. It requires more engineering — cache invalidation is, as always, one of the two hard problems — but the payoff is significant.
The ROI
The numbers are hard to argue with.
A client of ours was spending $45k/month on API calls for a customer-facing Q&A system. After implementing semantic caching with a 0.96 similarity threshold, their monthly API spend dropped to $18k. Latency for cached queries dropped from 2.8 seconds to 40 milliseconds.
That is not a rounding error. That is a 60% cost reduction and a 98% latency improvement for the majority of queries.
And there is a secondary benefit that teams rarely anticipate: consistency. When the same question gets the same answer every time, users trust the system more. Non-determinism is a feature when you need creativity. It is a bug when a user asks the same support question twice and gets contradictory answers.
When not to cache
Caching is not appropriate everywhere.
Do not cache when the answer depends on real-time data. Stock prices, live inventory, breaking news — these need fresh inference or at minimum very short TTLs.
Do not cache when the query includes user-specific context that changes the answer materially. “What’s my account balance?” is not cacheable across users, though it may be cacheable per-user with a short TTL.
Do not cache when you are still iterating on the prompt. Cached responses from an old prompt will persist until the cache is invalidated. If you change your system prompt, flush the cache.
And do not cache with a similarity threshold below 0.93. The false positive rate gets uncomfortable fast. One bad cached response erodes more trust than the caching saves in cost.
Implementation pattern
Here is the pattern we recommend:
- Start with exact match caching. Deploy it behind a feature flag. Monitor cache hit rate and output quality for two weeks.
- Add semantic caching once you are confident in the exact match layer. Start with a high similarity threshold (0.97) and lower it in increments of 0.01, monitoring quality at each step.
- Add TTL-based invalidation. Default to 24 hours. Shorten for data that changes frequently.
- Add source-based invalidation. When a source document is updated, invalidate all cache entries derived from it.
- Monitor cache hit rate, cost savings, latency distribution, and — critically — output quality. If quality degrades, raise the similarity threshold.
The whole thing can be built in a week. The first two steps can be done in a day.
The heuristic
If more than 20% of your production queries are semantically similar to previous queries, you should be caching. Check your logs. The number is almost always higher than you think.
Caching LLM responses is not cheating. It is the same engineering discipline we apply to every other expensive computation. The model is a function. Some inputs recur. Cache the outputs. Ship it.
tl;dr
The pattern. Teams pay for fresh LLM inference on every query even when 30–50% of those queries are semantically identical to ones already answered, because caching feels like it defeats the purpose of using AI. The fix. Layer exact-match caching first, then semantic caching at a 0.97 cosine similarity threshold and lower it gradually while monitoring quality. The outcome. API costs drop 40–60%, latency for cached queries falls from seconds to milliseconds, and users get more consistent answers.