← all field notes
№ 20 scope Jan 24, 2025 · 9 min read

The 90% accuracy problem

90% accuracy means 1 in 10 answers is wrong. Whether that is acceptable depends entirely on what happens when the wrong answer ships.


“We’re at 90% accuracy.” We hear this in almost every initial call. It is delivered like good news. Sometimes it is. Usually, we do not have enough information to know — and neither does the team saying it.

90% accuracy means 1 in 10 answers is wrong. Whether that is a rounding error or a crisis depends entirely on one question that almost nobody asks: what happens when the wrong answer ships?

The missing variable

Accuracy is not a quality metric. It is half of a quality metric. The other half is the cost of being wrong.

Consider two systems, both at 90% accuracy:

System A recommends blog posts to readers. When it gets it wrong, the reader sees an irrelevant article. They scroll past it. Nobody notices. Nobody cares. 90% accuracy is fine. 80% might be fine too.

System B answers patient questions about drug interactions. When it gets it wrong, a patient might take two medications that should not be combined. 90% accuracy means roughly 1 in 10 patients gets bad information. That is not a product issue. That is a liability issue.

Same accuracy number. Entirely different risk profiles. The number alone tells you nothing.

Why teams get stuck on a single number

There is a natural pull toward a single accuracy metric. It is easy to track. It goes on a dashboard. You can set a target and measure progress. Product managers love it. Executives love it more.

The problem is that a single number averages across all your failure modes. It treats a harmless miss the same as a dangerous one. It hides the distribution of errors behind a mean.

We audited a customer support system last year. Overall accuracy was 92%. Very respectable. But when we broke it down by category, the picture changed. For simple FAQ questions — “what are your hours,” “how do I reset my password” — accuracy was 98%. For billing disputes — “why was I charged twice,” “I want a refund” — accuracy was 71%.

The 92% number was masking the fact that the hardest, highest-stakes questions were the ones the system handled worst. Which makes sense — hard questions are hard. But the team was not tracking this. They saw 92% and moved on to other priorities.

The failure mode framework

Here is how we think about it. Classify every failure mode into one of four categories:

Harmless. The user notices the error but it has no consequence. A recommendation engine suggesting a mediocre article. A search system returning a slightly suboptimal result. The user shrugs and moves on.

Embarrassing. The error is visible and reflects poorly on the product, but causes no material harm. A chatbot giving a confidently wrong answer about your company’s founding date. A summarizer producing an awkwardly worded sentence. Trust erodes slowly.

Costly. The error has a direct financial or operational consequence. A pricing system that miscalculates a quote. A routing system that sends a high-value ticket to the wrong team. An extraction pipeline that pulls the wrong dollar amount from a contract.

Dangerous. The error creates legal, safety, or regulatory risk. Medical advice. Legal interpretation. Financial compliance. Anything where being wrong can hurt someone.

Once you have this classification, set accuracy thresholds per category — not for the system as a whole. 85% accuracy on harmless failures might be perfectly fine. 85% accuracy on dangerous failures is almost certainly not.

The threshold conversation

This is where it gets uncomfortable. Setting per-category thresholds forces you to have conversations that most teams would rather avoid.

“What accuracy do we need on billing questions before we let the AI handle them without human review?” That is a real question with real consequences. It requires input from legal, from customer success, from finance. It cannot be answered by the ML team alone.

Most teams skip this conversation. They ship with a single accuracy number and a vague sense that it is “good enough.” Then an edge case blows up, and they scramble.

The teams that do this well have a simple artifact — a table. Rows are failure categories. Columns are: current accuracy, target accuracy, what happens when it is wrong, and who approved the threshold. It is not a sophisticated document. It fits on one page. But it forces the conversation that needs to happen before you ship.

Measuring per-category accuracy

This requires labeled data that is tagged by category. Which means your eval set needs to be stratified, not just large.

A common mistake: teams build a 500-example eval set, sample randomly, and measure aggregate accuracy. The result is a number that over-represents common, easy cases and under-represents rare, hard ones. You end up with high accuracy on the things that did not need an AI system in the first place.

A better approach: build your eval set category by category. Ensure you have at least 50 examples per failure category — more for the dangerous ones. Measure accuracy within each category independently. Report the per-category numbers alongside the aggregate.

Yes, this is more work. It is dramatically less work than dealing with a production incident in a high-stakes category you were not measuring.

The human review escape hatch

For the dangerous categories, the answer is often not “improve accuracy.” The answer is “do not let the AI answer without human review.”

This is not a failure of the AI system. This is a design decision. A well-designed system knows its own limitations and routes accordingly. The AI handles the harmless and embarrassing categories autonomously. The costly and dangerous categories get flagged for human review.

The accuracy requirement for the routing itself is high — you need the system to correctly identify which category a query falls into. But that is a classification problem, and classification problems are much more tractable than open-ended generation problems.

We have seen teams spend months trying to improve accuracy on their hardest category from 85% to 95%. They would have been better served spending a week building a routing layer that sends those queries to a human. The accuracy improvement was not realistic on their timeline. The routing layer was.

The number you actually need

Here is the uncomfortable truth: there is no universally “good” accuracy number. There is only the number that is appropriate for your specific failure modes, your specific users, and your specific risk tolerance.

The heuristic: before you report an accuracy number, you should be able to answer “what happens when this is wrong?” for every category of error. If you cannot answer that question, the accuracy number is meaningless — you do not know what you are measuring against.

A system at 85% accuracy with well-understood, well-classified failure modes and appropriate human escalation paths is safer than a system at 95% accuracy where nobody has thought about what happens in the remaining 5%.

Measure the cost of being wrong. Then decide how often you can afford it. Then set the threshold. That is the order.

tl;dr

The pattern. Teams report a single aggregate accuracy number that masks the distribution of failures, so a 92% overall score hides 71% accuracy on the billing questions that matter most. The fix. Classify every failure mode as harmless, embarrassing, costly, or dangerous, build a stratified eval set with at least 50 examples per category, and set separate accuracy thresholds — not a single aggregate — for each one. The outcome. The dangerous categories get routed to humans before they cause liability events, and the team stops spending months chasing accuracy improvements that the architecture, not the model, needs to solve.


← all field notes Start a retainer →