← all field notes
№ 44 evals Jan 23, 2026 · 10 min read

Eval-driven development

Write the eval before you write the prompt. Run the eval before you ship the feature. Re-run the eval before you deploy the change. Evals are the tests of the AI era.


There is a workflow that most AI teams converge on eventually. The ones that converge on it early ship better products. The ones that converge on it late have a painful six months first.

The workflow is this: write the eval before you write the prompt. It is test-driven development for AI systems, and it is the single most important practice we recommend to teams building with language models.

The problem it solves

Without evals, the development cycle looks like this: write a prompt, try a few examples in the playground, look at the outputs, feel okay about them, ship. Two weeks later, a user reports a bad output. You tweak the prompt. Try the failing example. It works now. Ship. A week later, a different user reports a different bad output. Repeat.

This cycle has two problems. First, you are testing in production. Your users are your eval suite. They do not enjoy the role. Second, you have no way to know whether a change that fixes one problem breaks another. Every prompt change is a coin flip.

With evals, the cycle becomes: define what success looks like, build a test suite, iterate until the tests pass, ship. When something breaks in production, add it to the suite. Run the suite before every deploy. You still have production failures — but each one makes the system permanently better, because it becomes a test case that can never silently recur.

Write the eval first

This is the part teams resist. It feels backwards. “How can I write tests before I know what the system will do?”

You write the tests because you need to define what “working” means before you start building. This forces clarity. Instead of “the chatbot should be helpful,” you write concrete test cases:

  • Input: “What is your return policy?” Expected: Response mentions 30-day window. Response mentions the requirement for original packaging. Response does not mention competitor policies.
  • Input: “Can I return a used item?” Expected: Response clearly states that used items cannot be returned. Response suggests contacting support for exceptions.
  • Input: “How are you feeling today?” Expected: Response redirects to product-related topics. Response does not engage in personal conversation.

These test cases are imperfect. They are incomplete. They do not cover every edge case. That is fine. They cover the cases you know about, and they define a floor for behavior. The floor rises over time as you add more cases.

The act of writing the eval also surfaces design questions early. “What should the system do when asked about competitor products?” If you do not decide before building, you will discover the question in production — when a user screenshots a bad answer and posts it on Twitter.

The eval suite structure

A practical eval suite has three layers:

Deterministic checks. These are non-negotiable behaviors that can be verified programmatically. The output must be valid JSON. The output must not contain PII. The output must be under 500 tokens. The output must be in the specified language. These are cheap to run, fast to evaluate, and should never fail.

Semantic checks. These verify that the output contains or avoids specific content. “Response mentions the 30-day return window.” “Response does not include pricing information.” These can be checked with string matching, keyword detection, or — for fuzzier criteria — LLM-as-judge.

Quality checks. These assess the overall quality of the response against criteria like accuracy, helpfulness, tone, and completeness. These are almost always evaluated with LLM-as-judge or human review. They are the most expensive layer but also the most informative.

Not every test case needs all three layers. Start with deterministic checks for structural requirements and semantic checks for content requirements. Add quality checks for your most important use cases.

The daily workflow

Here is what eval-driven development looks like day to day:

Morning. Developer picks up a task — maybe a new feature, maybe a bug fix, maybe a prompt improvement. Before touching the prompt, they write 3-5 new eval cases that define what the change should accomplish.

Midday. Developer iterates on the prompt. After each change, they run the eval suite. The suite includes the new cases plus all existing cases. They watch for two things: do the new cases pass? Do any existing cases break?

Afternoon. The eval suite passes. The developer opens a pull request. The PR includes the prompt change and the new eval cases. The reviewer can see exactly what behavior the change is supposed to produce and verify that the eval cases are reasonable.

Deployment. CI runs the full eval suite against the changed prompt. If the suite passes, the change is deployed. If it fails, the deploy is blocked. The developer is notified and investigates.

This workflow is slower than “edit and ship” on day one. By month three, it is faster. The team spends less time debugging production issues, less time reverting bad changes, less time explaining to stakeholders why the AI said something it should not have said.

The eval suite as artifact

Over time, the eval suite becomes the team’s most important artifact. More important than the prompt. More important than the model selection. More important than the architecture.

Here is why: the eval suite encodes what “working” means. It is the cumulative knowledge of every failure, every edge case, every design decision. A new team member can read the eval suite and understand the system’s intended behavior faster than they can read the code.

When you switch models — and you will — the eval suite tells you whether the new model meets the bar. When you rewrite the prompt — and you will — the eval suite tells you whether the rewrite preserved the behaviors that matter. When you redesign the pipeline — and you will — the eval suite is the constant.

Prompts are ephemeral. Models are ephemeral. The eval suite is the thing that persists.

The economics

Teams that adopt eval-driven development report a consistent pattern:

  • Week 1-2. Slower. Writing evals takes time. The team feels like they are over-investing in testing.
  • Month 1. Neutral. The eval suite catches a few regressions that would have been production incidents. Time saved on debugging roughly offsets time spent on eval writing.
  • Month 3. Faster. The eval suite is mature enough that prompt changes can be made confidently. The team iterates faster because they know immediately whether a change works. Production incidents drop.
  • Month 6. Significantly faster. The eval suite is comprehensive. New features are built against existing eval infrastructure. Onboarding new team members is faster because the eval suite serves as documentation.

The teams that never adopt evals stay in the “edit, ship, pray” loop. They ship about as fast in month six as they did in month one — but they spend an increasing share of their time on firefighting.

Common objections

“Evals are expensive.” A 100-case eval suite costs $5-15 per run with LLM-as-judge. You run it a few times a day during development. That is $20-60 per day. Your production AI spend is orders of magnitude higher. The eval cost is the cheapest insurance you will buy.

“LLM-as-judge is unreliable.” It is imperfect. It has a 5-10% error rate on nuanced judgments. But you are not asking it for nuance. You are asking it for gross failures — did the response mention the return policy or not? At that level, LLM-as-judge is quite reliable. Use deterministic checks where you can. Use LLM-as-judge for the rest.

“We don’t know what all the edge cases are.” You do not need to. Start with the cases you know. Add production failures as they occur. The suite grows organically. Perfection is not the goal. Coverage is.

The heuristic

Write the eval before you write the prompt. Every production failure becomes a new eval case. Run the suite before every deploy. The eval suite is the artifact that compounds — it gets more valuable with every failure it encodes and every regression it catches. If you build one thing well on your AI team, build the eval suite.

tl;dr

The pattern. AI teams build the feature first, demo it, ship it, and only think about evals after the first production incident — which means they spend months in an “edit, ship, pray” loop where every prompt change might silently break a behavior they fixed last week. The fix. Write the eval before you write the prompt: define what “correct” means, build 20–50 test cases, and block deploys in CI when the suite score drops below threshold. The outcome. The team ships slower in week one and significantly faster by month three, because every change is made against a growing specification of intended behavior rather than into the dark.


← all field notes Start a retainer →