№ 28 evals May 23, 2025 · 10 min read

Your test suite passed. Your system is still broken.

A passing test suite for an AI system is necessary but dangerously insufficient. The failures that hurt you are the ones your test suite was not designed to catch.

A passing test suite for an AI system tells you one thing: the known scenarios still work. It tells you nothing about the unknown ones. And with AI systems, the unknown scenarios are where the failures live.

The green checkmark problem

Traditional software has a useful property: it is deterministic. Given the same input, it produces the same output. Your test suite verifies this contract. If the tests pass, the contract holds. You can deploy with confidence.

AI systems do not have this property. The same input can produce different outputs. The model’s behavior changes with temperature, with context window contents, with the phase of the moon — or more precisely, with the random seed, the batching order, and whatever the provider changed in their last silent update.

Your test suite checks the scenarios you thought of. It passes. You deploy. Then a user sends a query you did not think of — phrased in a way your test cases do not cover, referencing a topic your eval set does not include — and the system fails. Not with an error. With a confident, plausible, wrong answer.

This is worse than a crash. A crash is visible. A wrong answer is invisible until someone notices.

Why traditional testing is not enough

A traditional test suite for a deterministic system is a contract verification tool. You define the expected behavior, you assert against it, you move on. The surface area is bounded — you can enumerate the states, or at least the important ones.

An AI system’s surface area is unbounded. The input space is natural language — every possible sentence, in every possible context, with every possible intent. You cannot enumerate it. You can sample it, but your samples are biased by your own imagination.

The tests you write reflect the scenarios you can think of. The failures that hurt you are the ones you cannot. This is not a testing problem. It is an epistemological problem. And the solution is not “write more tests” — it is “supplement your tests with mechanisms that find the scenarios you missed.”

The three supplements

We recommend three additions to every AI system’s test suite. None of them are optional.

1. Fuzz testing.

Send random, malformed, adversarial, and unexpected inputs to your system. Not just once — continuously, as part of CI.

This is not novel. Fuzz testing has been standard practice in security engineering for decades. The surprise is how few AI teams do it. They test with well-formed queries from their eval set and call it done.

A basic fuzz test for an AI system:

Random strings. Unicode. Empty inputs. Inputs that are 100,000 characters long.
Inputs in languages your system does not support. Inputs that mix languages mid-sentence.
Inputs that are technically valid but semantically nonsensical.
Inputs that contain prompt injection attempts — “Ignore previous instructions and…”
Inputs that reference your system prompt, your company name, your competitors.

You are not looking for correct answers. You are looking for catastrophic failures — crashes, infinite loops, data leaks, offensive outputs, or responses that reveal system internals. Set your assertions accordingly: the system should not crash, should not leak the system prompt, should not produce output longer than X tokens, should respond within Y seconds.

Run this nightly. Keep the seeds that trigger failures. Add them to your regression set.

2. A regression log.

Every production failure becomes a test case. No exceptions.

When a user reports a bad output, when your monitoring catches a hallucination, when a support ticket mentions the AI giving wrong information — that input-output pair goes into a regression log. The log becomes a test suite. Run it on every deployment.

This sounds obvious. In practice, most teams do not do it. The failure gets fixed — the prompt gets tweaked, the context gets adjusted — but the test case does not get written. Three months later, a different change reintroduces the same failure. Nobody connects the dots.

The regression log is your institutional memory for AI failures. It grows over time. It gets more valuable as it grows. After six months, you have a test suite that reflects the actual failure modes of your system, not the hypothetical ones you imagined at design time.

The mechanics are simple. A shared document or a database table with three columns: input, bad output, expected output. A script that runs every entry against the current system and flags regressions. Integrate it into CI.

3. Periodic red-teaming.

Once a month, someone on your team — or better, someone not on your team — spends a focused session trying to break the system. Not automated testing. Human adversarial testing.

The red-teamer’s job is to find failures that neither the fuzz tests nor the regression log would catch. They bring creativity, domain knowledge, and malicious intent — the combination that produces the most interesting failures.

What a red-team session looks like:

2 hours, focused. One or two people. A shared doc for findings.
Try to make the system contradict itself. Ask the same question two different ways and see if you get conflicting answers.
Try to make the system exceed its authority. Ask it to do things it should not be able to do.
Try to make the system leak information. Ask it about other users, internal processes, system configurations.
Try to make the system produce harmful output. This is uncomfortable but necessary.
Try edge cases specific to your domain. If you are in healthcare, try drug interactions. If you are in finance, try market manipulation scenarios.

Every finding goes into the regression log. The red-team session feeds the automated tests. Over time, the automated tests get better because they are shaped by human adversarial thinking.

The integration

These three supplements are not separate from your test suite — they feed into it. The fuzz tests find crash-level failures, which become regression tests. The regression log captures production failures, which become permanent test cases. The red-team sessions find creative failures, which become both regression tests and new fuzz test patterns.

Your test suite grows in the direction of your actual failure modes, not your imagined ones. After six months, you have a test suite that would have caught 80% of the failures you encountered — because it was built from those failures.

The cost

A fuzz test takes a day to set up and runs on a schedule. The regression log is a process change, not a technical one. The red-team session is 2 hours per month. Total investment: maybe 2 engineer-days per month.

Compare this to the cost of a production failure in an AI system — a hallucinated medical dosage, a leaked customer record, a confidently wrong financial calculation. The math is not close.

The heuristic

A green test suite means your known scenarios work. It says nothing about the unknown ones. Supplement it with three things: fuzz tests for crash-level failures, a regression log for production failures, and a monthly red-team session for creative failures. The test suite you ship with is not the test suite that will protect you. The one that protects you is the one shaped by real failures over time.

tl;dr

The pattern. AI teams ship with a passing test suite that only covers scenarios they imagined, while production failures arrive as confident wrong answers from inputs nobody thought to test. The fix. Supplement your test suite with nightly fuzz tests for crash-level failures, a regression log that converts every production incident into a permanent test case, and a monthly human red-team session. The outcome. After six months, your test suite reflects your system’s actual failure modes rather than your assumptions about them, and you catch regressions before users do.

// co-written with ai · edited by humans

← all field notes Start a retainer →

// related notes

If you can't eval it, don't ship it 10 min
Your agent is a cronjob. Name it that. 7 min
Eval-driven development 10 min