If you can't eval it, don't ship it
Evals are not the thing you add after launch. They are the thing that tells you whether launching is a good idea.
Every AI feature we’ve seen regress in production had something in common. It shipped without an eval suite. The team planned to “add evals later.” Later never came — or it came after the second incident, by which time nobody trusted the system and the fix was political, not technical.
This is the pattern. It is extremely common. And it is fixable, if you flip the order.
“We’ll add evals later”
Software engineers learned this lesson 20 years ago with unit tests. “We’ll write tests later” meant “we’ll never write tests.” The industry developed TDD, CI gates, coverage thresholds — not because testing is fun, but because the cost of not testing compounds silently until something breaks in production.
AI systems are the same problem, but worse. A traditional bug crashes. A log line appears. Someone gets paged. An AI regression does none of those things. The model returns a plausible-looking wrong answer. The user sees it. Maybe they notice, maybe they don’t. Your dashboards stay green. Your error rate is zero. Your system is confidently wrong and nobody knows.
This is why “we’ll add evals later” is more dangerous than “we’ll write tests later.” Tests catch failures that announce themselves. Evals catch failures that don’t.
The order is wrong
Most teams we work with build in this order:
- Build the feature
- Demo it to stakeholders
- Ship it
- Get a bug report
- Panic
- Build an eval to prove the fix works
The order should be:
- Define what “correct” means for this feature
- Build an eval that measures it
- Build the feature
- Run the eval
- Ship when the eval passes
- Run the eval on every deploy
Step 1 is the hardest part. It forces you to answer questions you’d rather defer. What does a good answer look like? How wrong is too wrong? What are the edge cases? If you cannot answer these questions, you are not ready to build the feature — you just don’t know it yet.
What a minimal eval suite looks like
You do not need a research-grade evaluation framework. You need three things.
A golden set. 50–100 input-output pairs where you know the correct answer. For a RAG system, these are questions paired with the documents that contain the answers. For a classification agent, these are inputs paired with correct labels. For a generation system, these are prompts paired with reference outputs and a rubric. Building this takes 1–2 days. It is the single highest-leverage day your team will spend.
A scoring function. Something that takes a system output and a reference answer and returns a number. This can be exact match. It can be cosine similarity. It can be an LLM-as-judge with a rubric. It does not need to be perfect. It needs to be consistent enough to catch regressions.
A CI gate. The eval runs on every PR that touches the retrieval pipeline, the prompt, or the model config. If the score drops below a threshold, the PR does not merge. This is the part that actually prevents regressions. Without it, the golden set is just a spreadsheet someone checks once a quarter.
That is it. Golden set, scoring function, CI gate. You can build this in a week. You can build a rough version in a day.
The failure modes we keep seeing
“Our eval is a vibe check.” Someone on the team runs 10 queries manually and says “looks good.” This catches nothing. It is not repeatable. It does not run in CI. It is a ritual, not a test.
“Our eval is too expensive to run on every deploy.” Then make it cheaper. Subsample your golden set. Use a faster model for scoring. Run the full suite nightly and a smoke test on every PR. The constraint is not cost. The constraint is that you have not decided to prioritize it.
“We don’t know what correct looks like.” This is the most honest version. And it means you are not ready to ship the feature. If you cannot define correct, you cannot measure it. If you cannot measure it, you cannot know whether your next deploy made it better or worse. You are flying blind. That is fine in a prototype. It is not fine in production.
“Our system is too creative to eval.” No it isn’t. Even creative outputs have properties you can measure — factual accuracy, format compliance, toxicity, length, presence of required information. You are not evaluating whether the output is beautiful. You are evaluating whether it is broken.
The heuristic
If you can’t eval it, you can’t ship it. If you can’t re-eval it on every deploy, you can’t maintain it.
This sounds strict. It is. AI systems degrade in ways that are invisible until they are expensive. A prompt change that improves one class of queries and silently breaks another. An embedding model update that shifts your vector space. A chunking change that drops context your users depend on. None of these will page anyone. All of them will erode trust.
The eval suite is not overhead. It is the only thing standing between you and a system that is getting worse and you don’t know it.
We have seen this pattern dozens of times. The teams that build evals first ship slower in week one and faster in month three. The teams that skip evals ship fast, then spend a quarter rebuilding trust — with their users, with their stakeholders, and with themselves.
Build the eval first. Then build the feature. The order matters.
tl;dr
The pattern. Teams ship AI features without evals because they plan to “add them later,” which means they never get added until after the second production incident — by which point the system has eroded user trust and the fix is political as much as technical. The fix. Before writing the prompt or building the pipeline, define what “correct” means, assemble a 50–100 item golden set, write a scoring function, and wire a CI gate that blocks merges when the score drops below threshold. The outcome. Regressions from prompt changes, embedding model swaps, and chunking updates get caught in the pull request rather than discovered by users, and the team can ship changes confidently instead of shipping them hopefully.