№ 06 evals Jun 28, 2024 · 9 min read

Benchmarks are vanity metrics

MMLU, HellaSwag, HumanEval — these tell you which model wins a standardized test. They do not tell you which model works for your use case. Build your own benchmark or fly blind.

Every model release in mid-2024 came with a table. Rows of benchmark names — MMLU, HellaSwag, HumanEval, GSM8K, ARC-Challenge. Columns of numbers. The new model’s numbers were higher than the previous model’s numbers, or higher than the competitor’s numbers, or — if neither of those was true — higher on a carefully selected subset of benchmarks.

The tables were impressive. They were also almost entirely useless for making production decisions.

What benchmarks measure

Public benchmarks measure general capability on standardized tasks. MMLU tests broad knowledge across academic subjects. HumanEval tests code generation on isolated programming problems. HellaSwag tests commonsense reasoning in sentence completion. GSM8K tests grade-school math.

These are real capabilities. They correlate, loosely, with general model quality. A model that scores poorly on all of them is probably not a good model. A model that scores well on all of them is probably a decent model.

But “probably a decent model” is not a production decision. A production decision is: which model, at which price point, at which latency, performs best on my specific task?

And for that question, public benchmarks tell you almost nothing.

The gap between general and specific

Here is a thing we have measured directly, across multiple client engagements: two models with a 2-point difference on MMLU can have a 20-point difference on a task-specific eval.

This is not an exaggeration. It’s not an edge case. It’s the norm.

A model that is 3% better at broad academic knowledge can be 30% worse at extracting line items from invoices. A model that scores higher on code generation benchmarks can be worse at generating code in your specific framework, with your specific conventions, against your specific APIs.

The reason is straightforward. Public benchmarks are averages across broad categories. Your use case is a specific point in a vast capability space. The average tells you very little about the specific point.

Consider what it would mean in other domains. You wouldn’t pick a database by looking at TPC-C benchmarks alone. You’d run your workload on the candidates and measure. You wouldn’t pick a frontend framework by looking at synthetic render benchmarks. You’d prototype with your actual components and measure. Model selection should work the same way.

The benchmark culture

Mid-2024 had a distinctive culture around model releases. A new model would drop. Twitter would erupt with benchmark comparisons. Hot takes would fly about which model was “better.” Teams would start migration discussions based on the benchmark table in the announcement blog post.

This is backwards. The benchmark table is marketing material. It is not evaluation. The model provider chose which benchmarks to highlight. They tuned for those benchmarks. They cherry-picked the comparison points. This is not nefarious — it’s what every company does with every product launch. But treating marketing material as engineering data is a mistake.

The more subtle problem: benchmark numbers create a false sense of precision. “Model A scores 87.3 on MMLU, Model B scores 85.1.” That 2.2-point difference feels meaningful. It is not meaningful for your production use case. The confidence interval on “how well will this model perform on my specific task” is vastly wider than 2.2 points.

Why your task is special

Every team thinks their use case is standard. “We’re just doing summarization.” “It’s basic classification.” “We’re just extracting entities.”

Your use case is not standard. Your summarization has specific length requirements, a specific tone, specific things that must be included, and specific things that must not. Your classification has categories that overlap in domain-specific ways. Your entity extraction deals with formats and edge cases that no benchmark covers.

The gap between the generic task and your specific implementation of that task is where model performance varies wildly. And it’s the gap that public benchmarks don’t measure, because they can’t — they’re generic by definition.

How to build a task-specific eval

You don’t need hundreds of examples. You need 50 good ones.

Start with production data. Pull real inputs from your system — or realistic synthetic ones if you’re pre-launch. Don’t invent examples from scratch. Real data has real messiness, and that messiness is where models differ.

Label the outputs yourself. Have a domain expert — someone who understands what good looks like — review model outputs and rate them. A simple 1-5 scale works. “Would you be comfortable showing this to a user?” works even better.

Cover the edges. Don’t just test the happy path. Include the inputs that are ambiguous, malformed, adversarial, or just weird. These are where models diverge most.

Automate the run. Write a script that sends your 50 examples to each candidate model, collects the outputs, and presents them for review. This takes a few hours the first time. After that, it takes minutes.

Track over time. Every time a new model drops, run your eval. The number you care about is not the public benchmark — it’s your benchmark. “Model A scores 4.2/5 on our task, Model B scores 3.8/5 on our task.” That’s a production decision.

The meta-benchmark problem

There’s a subtler issue with public benchmarks: they become targets. Once a benchmark is widely used, model providers optimize for it. Not through outright data contamination — though that happens — but through training emphasis. If MMLU is the benchmark everyone watches, you allocate more training compute to the kinds of knowledge MMLU tests.

This is Goodhart’s Law applied to AI: when a measure becomes a target, it ceases to be a good measure. The benchmark scores go up, but the improvement doesn’t transfer uniformly to all tasks. It transfers most to tasks similar to the benchmark and least to tasks that are different.

Your production task is, almost certainly, different from any public benchmark. Which means the benchmark improvement you see in the release blog post overstates the improvement you’ll see in practice.

When benchmarks are useful

Benchmarks are not useless. They’re useful for two things.

Filtering. If a model scores poorly across all major benchmarks, you can probably skip it. Benchmarks are a reasonable lower bound on capability. They’re just not a useful upper bound on task-specific performance.

Tracking trends. Watching benchmark scores over time — across model families and providers — tells you how fast the field is moving and which capabilities are improving fastest. This is useful for strategic planning. It is not useful for model selection.

For everything else — for the actual decision of which model to deploy in production — you need your own eval.

The heuristic

Never make a model selection decision based on public benchmarks alone. Build a task-specific eval with 50 examples from your domain. Run every candidate model against it. Use the results — not the benchmark table — to decide.

If you don’t have time to build a 50-example eval, build a 10-example eval. If you don’t have time for 10, something is wrong with your priorities. You’re about to put a model in front of users and you can’t spend an afternoon checking whether it’s good at the thing you’re using it for.

The model that wins on MMLU might lose on your task. The model that loses on HumanEval might be the best at your specific code generation problem. You will not know until you measure. Measure.

tl;dr

The pattern. Teams use public benchmark scores — MMLU, HumanEval, GSM8K — to make model selection decisions, not realizing that a 2-point gap on a generic benchmark can flip to a 20-point gap in the other direction on their specific task. The fix. Build a 50-example task-specific eval using real inputs from your domain, have a domain expert label the outputs, and run every candidate model against it before making any production decision. The outcome. Model selection becomes a measurement exercise rather than a marketing exercise, and you stop discovering mid-migration that the “better” model is actually worse for the thing you’re using it for.

// co-written with ai · edited by humans

← all field notes Start a retainer →

// related notes

If you can't eval it, don't ship it 10 min
Your agent is a cronjob. Name it that. 7 min
Eval-driven development 10 min