№ 38 evals Oct 24, 2025 · 9 min read

Regression suites for prompts

Every prompt change is a potential regression. If you do not have a test suite that runs before every prompt deployment, you are testing in production.

You changed three words in a system prompt. The feature that was broken is now fixed. You deploy. The next morning, a different feature is broken — one you did not touch, did not test, did not even think about.

This is the normal state of prompt engineering without regression testing. Every prompt change is a blind trade. You fix one behavior and break another, and you do not find out until a user tells you.

Prompts are code

We treat prompts like copy. We edit them in a text box. We eyeball a few examples and ship. We do not test them. We do not version them rigorously. We do not run them through CI.

This is a mistake. Prompts are the control plane for model behavior. A prompt change is a behavior change. And behavior changes need tests — the same way code changes need tests.

The difference is that prompt behavior is non-deterministic. The same prompt can produce different outputs on successive runs. This makes testing harder, but it does not make testing optional. It makes testing more important, because you cannot rely on manual spot-checking to catch regressions.

What a regression suite looks like

A regression suite for prompts is a set of input/output pairs where you know the expected behavior. Not the exact expected output — the expected behavior.

Each test case has three parts:

Input. The user query or the full prompt template with variables filled in.
Expected behavior. Not the exact string, but a description of what the output should do. “Should mention the return policy.” “Should not include pricing.” “Should respond in Spanish.” “Should decline to answer.”
Assertion. A function that checks whether the output meets the expected behavior. This can be a string match, a regex, an LLM-as-judge call, or a human review — depending on how precise the behavior is.

Start with 20 test cases. That is enough to catch gross regressions. You do not need 500 on day one. You need 20 that cover the most important behaviors.

Where the test cases come from

The first 20 are easy. You already know them — they are the queries you manually tested when you built the feature. The ones you pasted into the playground. Write them down. Add the expected behavior. That is your initial suite.

After that, every production failure becomes a test case. User reports a bad answer. You investigate. You fix the prompt. You add the failing query to the suite with the correct expected behavior. Now that failure can never recur silently.

This is the key insight: your regression suite is a record of every lesson learned. It encodes institutional knowledge about what the system should and should not do. Six months in, your suite is the most valuable artifact on the team — more valuable than the prompt itself, because the suite defines what the prompt is supposed to achieve.

The workflow

Here is how it works in practice:

Developer wants to change a prompt.
Developer makes the change locally.
Developer runs the regression suite against the changed prompt. This takes 2-10 minutes depending on suite size and model latency.
Suite passes — the change did not break any known behavior. Deploy.
Suite fails — the change broke something. Developer fixes the prompt or updates the test case if the old behavior was wrong.

This is not revolutionary. It is test-driven development applied to prompts. The only novel part is that assertions are fuzzier — you are checking behavior, not exact output.

LLM-as-judge assertions

For many test cases, the assertion is hard to write as a regex or string match. “The response should be helpful and accurate” is not something you can check with contains().

This is where LLM-as-judge works well. Use a separate model call — ideally a different model than the one being tested — to evaluate whether the output meets the expected behavior. The judge prompt is simple: “Given this input and this expected behavior, does this output meet the criteria? Respond yes or no.”

LLM-as-judge is not perfect. It has a ~5-10% error rate on nuanced judgments. But it is good enough for regression testing, where you are looking for gross failures, not subtle quality differences. And it is vastly better than no testing at all.

For critical behaviors — safety, compliance, factual accuracy — use deterministic assertions where possible. Reserve LLM-as-judge for softer criteria like tone, helpfulness, and completeness.

The cost objection

“Running the suite costs money. Every test case is an API call — sometimes two, if we’re using LLM-as-judge.”

Yes. A 50-case suite with LLM-as-judge costs maybe $2-5 per run. You run it a few times a day during development. That is $10-20 per day. Your production AI spend is probably $500-5000 per day.

The cost of the regression suite is a rounding error compared to the cost of a production regression that serves bad answers to real users for hours before someone notices.

Growing the suite

The suite should grow monotonically. You add cases. You almost never remove them.

When a production failure occurs, add a test case before you fix the prompt. This is the same discipline as writing a failing test before fixing a bug. It proves the test catches the failure. Then fix the prompt. The test passes. Ship it.

Over time, you will notice clusters. Certain categories of queries are fragile — they break more often. These clusters tell you where the prompt is weakest and where to invest in improvements.

A healthy suite grows by 2-5 cases per week. After six months, you have 100-200 cases. After a year, 200-400. At that point, the suite is a comprehensive specification of your system’s behavior. New team members can read the suite and understand what the system does faster than they can read the code.

The heuristic

If you are deploying prompt changes without running a regression suite, you are testing in production. Your users are your test suite. They are not good at it, and they do not enjoy it.

Start with 20 cases. Add every failure. Run it before every deploy. This is the minimum viable practice for professional prompt engineering.

tl;dr

The pattern. Teams change a prompt to fix one behavior, ship it, and discover the next morning that a different behavior broke — because prompts are treated like copy instead of code and deployed without any regression testing. The fix. Build a suite of input/expected-behavior pairs, add every production failure as a new test case, and run the suite before every prompt deploy. The outcome. Prompt changes stop being blind trades, institutional knowledge about what the system should do compounds into the suite, and production regressions become rare instead of routine.

// co-written with ai · edited by humans

← all field notes Start a retainer →

// related notes

If you can't eval it, don't ship it 10 min
Your agent is a cronjob. Name it that. 7 min
Eval-driven development 10 min