The latency budget your PM forgot
Your product spec says 'fast.' Your LLM call takes 3 seconds. Your retrieval takes 800ms. Your reranker takes 400ms. You are already at 4.2 seconds before any business logic.
Your product spec says “fast.” Let’s do the math.
Your LLM call takes 3 seconds. Your retrieval takes 800ms. Your reranker takes 400ms. Your post-processing — guardrails, formatting, logging — takes another 200ms. You are at 4.4 seconds before any business logic, before any database writes, before the response even starts rendering in the UI.
Your PM’s mental model is a web app. Click a button, see a result. 200ms feels fast. 500ms feels acceptable. Anything over a second feels slow. They specced the feature assuming the latency profile of a REST API. You are building something with the latency profile of a batch job.
Nobody talked about this at kickoff. Now it’s week 6 and the feature works but nobody wants to use it because it takes 5 seconds.
The components
Here’s where the time goes in a typical RAG-powered feature:
Embedding the query: 50-100ms. This is the cheap one. People rarely worry about this, and they’re right not to.
Retrieval: 200-800ms. Depends on your vector database, your index size, and how much filtering you’re doing. Most managed vector databases land around 200-400ms for a simple query. Add metadata filtering and it climbs. Add hybrid search — vector plus keyword — and you’re north of 500ms.
Reranking: 200-600ms. If you’re using a cross-encoder reranker — and you should be, the quality improvement is real — you’re adding another few hundred milliseconds. The latency scales with the number of candidates you rerank. Rerank 20 chunks and it’s fast. Rerank 100 and it’s not.
LLM call: 1-8 seconds. This is the dominant cost. It depends on the model, the prompt length, and the output length. GPT-4-class models are 2-5 seconds for a typical completion. Smaller models are faster but less capable. Streaming helps the perceived latency but doesn’t reduce time-to-last-token.
Post-processing: 100-500ms. Guardrails, output validation, structured extraction, logging, writing to a database. Each step is small. They add up.
Network overhead: 100-300ms. Round trips to external services, TLS handshakes, DNS lookups. If your vector database is in a different region than your compute, add more.
Total: 2-10 seconds for a single turn. And that’s the happy path — no retries, no fallbacks, no model timeouts.
Why PMs don’t think about this
Product managers think in user stories, not in architecture diagrams. When a PM writes “user asks a question, system returns an answer,” the implicit assumption is that the answer appears quickly. They’re pattern-matching against search — type a query, see results. Google does it in 400ms. How hard can it be.
The gap is that nobody sits down with the PM and says: here is the latency budget for this feature. Here is what each component costs in wall-clock time. Here is the total. Do you still want to build it this way?
This conversation should happen at spec time, not at demo time. But it almost never does, because at spec time the engineering team hasn’t built the thing yet and doesn’t have concrete numbers. So they say “it should be fine” and move on. By the time they have numbers, the feature is built and the only question is how to make it faster — not whether the approach was right in the first place.
The latency budget
A latency budget is exactly what it sounds like: a breakdown of how much time each component gets, summing to a total that the user experience can tolerate.
Here’s an example for a conversational RAG feature with a 3-second target:
| Component | Budget | Notes |
|---|---|---|
| Query embedding | 80ms | Mostly fixed |
| Retrieval | 300ms | Requires index tuning |
| Reranking | 250ms | Limits candidate count to 20 |
| LLM (TTFT) | 1500ms | Streaming, time to first token |
| LLM (full) | 2500ms | Streaming hides this |
| Post-processing | 200ms | Async where possible |
| Network | 170ms | Co-locate services |
| Total (perceived) | 2300ms | With streaming |
| Total (actual) | 3500ms | Full completion |
The perceived latency — the time until the user sees something happening — is 2.3 seconds because streaming starts delivering tokens after the time-to-first-token. The actual latency is 3.5 seconds for the full response. This is the difference between “feels responsive” and “feels slow” even though the underlying work is identical.
Notice that the budget forces design decisions. Reranking is capped at 20 candidates, which means retrieval needs to return high-quality results in the first pass. Post-processing must be async where possible — log writes and analytics don’t block the response. Services must be co-located to keep network overhead low.
These are engineering decisions driven by a latency budget. Without the budget, you make these decisions reactively — after the thing is too slow — instead of proactively.
UX patterns that buy time
When your latency budget exceeds what a synchronous interaction can tolerate, you have three options. Most teams reach for the first and ignore the other two.
Streaming. Stream the LLM output token by token. This is table stakes now. It drops perceived latency from time-to-last-token to time-to-first-token, which is typically 500-1500ms faster. But streaming doesn’t help with the pre-LLM latency — retrieval and reranking still block.
Progressive loading. Show intermediate results as they become available. Show the retrieved sources before the LLM response. Show a skeleton of the answer before it’s complete. Show confidence indicators that update as more context is processed. This is more work than streaming but it transforms a 4-second wait into a 4-second experience where things are visibly happening.
Async processing. Not every AI interaction needs to be synchronous. If the user is submitting a document for analysis, the result can arrive in a notification. If the user is requesting a report, it can be emailed. The UX should match the latency, not the other way around. A 30-second generation is unbearable as a synchronous wait and perfectly fine as “we’ll notify you when it’s ready.”
The choice depends on the use case. Chat interfaces need streaming. Search interfaces need progressive loading. Document processing can be async. The mistake is assuming everything must be synchronous because the PM specced it as a button click.
The conversation to have
Before you write a line of code, have this conversation with your PM:
- Here is the latency budget for this feature. Here is what each component costs.
- The total is N seconds. Here is what the user will experience.
- Given that latency, here are the UX options: streaming, progressive loading, async.
- Which of these is acceptable? That determines how we build it.
If the PM says “none of those are acceptable, it needs to be under 500ms” — great, now you know this feature requires a fundamentally different architecture. Maybe you pre-compute. Maybe you use a smaller model. Maybe you cache aggressively. But you know that before you build, not after.
The heuristic
Every AI feature needs a latency budget before it has a product spec. Add up the components. Show the total to your PM. If the number is uncomfortable, redesign the UX or redesign the architecture — but don’t pretend the number is going to be different when you ship.
tl;dr
The pattern. PMs spec AI features with a web-app mental model, nobody does the latency math at kickoff, and the team discovers at week 6 that retrieval plus reranking plus an LLM call adds up to 5 seconds before any business logic runs. The fix. Before writing a line of code, build a latency budget that breaks down each component’s wall-clock cost, show the total to your PM, and choose the appropriate UX pattern — streaming, progressive loading, or async — based on what the number actually is. The outcome. Architecture decisions get made proactively during design instead of reactively after users complain that the feature is too slow to use.