← all field notes
№ 01 evals Apr 12, 2024 · 8 min read

The eval you skipped is the one that bites

Teams skip evals on exactly the features that need them most — the ones where 'correct' is hard to define. That difficulty is the signal, not the excuse.


Every team we talked to in Q2 2024 was shipping LLM features. Summarization, extraction, chat, search. The race was real — GPT-4 was mature, Claude 3 had just landed, and the window to build something differentiated was shrinking by the week.

Almost none of them had evals.

Not “not enough evals.” Not “evals that weren’t great.” None. Zero. A model call in production, returning text to users, with no automated way to know if the output was any good.

The excuse

The excuse was always the same. Paraphrased: “We can’t define what good looks like for this feature, so we’ll just ship and see.”

This is backwards. If you can’t define what good looks like, that is the feature most likely to regress. It is the feature most likely to silently degrade when you swap models, change a prompt, or update your retrieval pipeline. It is the feature where “ship and see” means “we’ll find out from our users, eventually, maybe.”

The features where “correct” is easy to define — extraction, classification, structured output — those are the ones that tend to hold up. You notice when they break because you have a schema and a test. The features where correctness is fuzzy — summarization, tone, helpfulness — those are the ones that rot.

Why fuzzy is not an excuse

Here is a thing that is true and that teams resist hearing: you do not need a perfect eval. You need a directional one.

Ten golden examples. Handwritten. Inputs you care about, outputs you’d be happy with. Run your system against them after every change. Did it get worse? Did it get better? You don’t need a score to three decimal places. You need a signal.

The bar is not “automated evaluation that captures every nuance of quality.” The bar is “better than nothing.” Nothing is what most teams had.

Consider what “nothing” actually looks like in practice. You change a prompt. You deploy. A PM notices a week later that summarizations are worse. They file a ticket. An engineer investigates. They can’t reproduce it because the inputs are different now. They tweak the prompt again. They deploy again. Nobody checks. The cycle repeats.

Now consider what “10 golden examples” looks like. You change a prompt. Your CI runs 10 examples. Three of them are clearly worse. You look at why. You fix the prompt. You deploy with confidence. Elapsed time: 20 minutes instead of 2 weeks.

The pattern we kept seeing

In the teams we advised that quarter, there was a reliable pattern. Teams would build an LLM feature in a few days. They’d spend a week on prompt engineering — getting the output to feel right. They’d ship it. Then they’d never touch the prompt again, because touching it meant risking a regression they couldn’t measure.

The prompt became frozen. Not because it was good, but because nobody had a way to tell if a change made it better or worse. The feature shipped at whatever quality level the first prompt achieved, and it stayed there.

This is the opposite of iteration. This is the opposite of what software engineering is supposed to be. We’ve spent decades building the infrastructure to change code safely — tests, CI, staging environments, feature flags. Then we put a model call in the middle and throw all of it away.

What a minimal eval looks like

You don’t need a framework. You don’t need an evaluation platform. You need a script and a file.

The file is your golden set. Each example has an input and one or more reference outputs. The reference outputs don’t need to be perfect — they need to be “acceptable.” You’re not testing for exact match. You’re testing for direction.

The script runs your system against the golden set and produces a report. The report can be as simple as “here are the outputs, diff them against last run.” For teams that want a number, use a model-as-judge pattern — have a second model rate the outputs on a 1-5 scale against criteria you define. It’s not perfect. It doesn’t need to be.

Here’s the part that matters: the golden set is curated by a human who understands the feature. Not generated. Not sampled randomly from production. Handpicked. The 10 inputs that you’d be most embarrassed to get wrong. The edge cases you thought about during design. The examples your PM showed in the demo.

Those 10 examples are worth more than a thousand random ones, because they encode your taste. They represent your opinion about what good looks like — and having an opinion, even an imperfect one, is infinitely better than having no opinion at all.

The cost of skipping

We saw three flavors of pain from teams that skipped evals.

The silent regression. A model provider updates their API. The output format shifts subtly. Nobody notices for three weeks. Customer complaints trickle in. By the time someone investigates, there’s no baseline to compare against.

The frozen prompt. As described above. The team wants to improve the feature but can’t, because any change is a leap of faith. The feature ships at v1 quality and stays there.

The model migration tax. Team wants to switch from GPT-4 to Claude 3 (or vice versa) for cost or latency reasons. Without evals, the migration is a full manual QA cycle. With 10 golden examples, it’s a 5-minute script run. The teams without evals either don’t migrate — leaving money on the table — or migrate blind and pray.

The meta-point

The difficulty of defining “correct” is not a reason to skip evals. It is the reason you need them.

Easy-to-eval features are easy to catch when they break. Hard-to-eval features are the ones that break silently, regress slowly, and create the kind of quality debt that compounds until someone notices you’re shipping garbage.

The harder it is to define what good looks like, the more valuable even a rough approximation becomes. A noisy signal is better than no signal. A biased eval is better than no eval. Ten examples are better than zero.

The heuristic

Before you ship an LLM feature, write down 10 inputs and 10 outputs you’d be happy with. Run your system against them. Save the results. Run them again after every change. That’s it. That’s the eval.

If you can’t write down 10 examples, you don’t understand the feature well enough to ship it.

tl;dr

The pattern. Teams skip evals on exactly the features where “correct” is hardest to define — summarization, tone, helpfulness — and those are precisely the features that silently rot in production. The fix. Before you ship any LLM feature, write 10 representative input/output pairs, run your system against them after every change, and treat any regression as a blocker. The outcome. What once took two weeks of PM complaints and guesswork to diagnose takes 20 minutes, and your prompts can finally evolve instead of staying frozen at v1 quality forever.


← all field notes Start a retainer →