№ 09 ops Aug 09, 2024 · 8 min read

Structured outputs don't fix structured thinking

JSON mode and function calling are great. But if the model doesn't understand what you're asking it to extract, you just get well-formatted garbage.

JSON mode and function calling are great. But if the model doesn’t understand what you’re asking it to extract, you just get well-formatted garbage. We see this pattern constantly — teams ship structured outputs and assume the quality problem is solved.

The formatting problem is gone

For most of 2023, a significant chunk of LLM engineering was string parsing. You would ask the model to return JSON. Sometimes it did. Sometimes it wrapped it in markdown code fences. Sometimes it added a preamble. Sometimes the JSON was almost valid — a trailing comma here, a missing quote there.

Teams wrote fragile regex parsers. They retried on parse failures. They added “IMPORTANT: Return ONLY valid JSON” to their prompts in increasing font sizes.

Then structured outputs arrived — JSON mode, function calling, tool use with enforced schemas. The formatting problem vanished overnight. You define a schema, the model fills it in, the output parses every time. This was a genuine infrastructure win.

But it solved the wrong problem.

The thinking problem remains

Here is an extraction task we see regularly. A team wants to pull structured data from contracts — party names, effective dates, termination clauses, governing law. They define a JSON schema. They pass the contract to the model. They get back a perfectly formatted JSON object.

The party names are right 95% of the time. The effective date is right 90% of the time. The termination clause is right 70% of the time. The governing law is right 60% of the time.

Before structured outputs, the termination clause was wrong 30% of the time and the JSON was broken 15% of the time. Now the JSON is never broken and the termination clause is still wrong 30% of the time.

The formatting fix masked the extraction quality problem. The team’s error rate dropped — because parse failures went away — but the semantic accuracy did not change. They shipped it. Users started trusting the output because it looked clean and professional. Well-formatted JSON feels more reliable than a messy text blob, even when the content is identical.

This is the danger. Structured outputs increase trust without increasing accuracy.

Why the model gets it wrong

The model fails on extraction for the same reasons it always has. The information is ambiguous. The document uses domain-specific language the model has not seen enough of. The relevant clause is buried in a 40-page document and the model’s attention gets diluted. The schema asks for a field that requires inference, not extraction — “Is this contract auto-renewing?” is a judgment call, not a lookup.

None of these problems are formatting problems. Putting the answer in a JSON field does not make the model think harder about it. The model produces the same internal representation whether it outputs free text or structured JSON. The structured output layer is downstream of the thinking. It is a serialization step.

Think of it this way: if you ask someone who does not understand contracts to fill in a form about a contract, the form will be neatly filled in and mostly wrong. Giving them a better form does not help. Teaching them about contracts helps.

What actually improves extraction quality

Better prompts. This is unsexy but true. A prompt that explains what a termination clause is, what forms it can take, and what to do when it is ambiguous will outperform a terse prompt with a perfect schema every time. The schema tells the model what shape to produce. The prompt tells it what to think about.

Few-shot examples. Show the model 3-5 examples of inputs and correct outputs. Not synthetic examples — real ones, from your actual corpus, including the tricky cases. Few-shot examples communicate expectations more precisely than instructions. They show the model what “right” looks like in your domain.

Domain-specific validation. A well-structured output can be validated beyond “is this valid JSON.” Is the effective date in the future? Is the governing law a real jurisdiction? Is the extracted dollar amount within a plausible range? These checks catch errors that the model will make regardless of output format.

Decomposition. Instead of asking the model to extract 12 fields from a 40-page document in one pass, break it into steps. First, find the relevant section. Then extract from that section. This reduces the attention problem and gives you a chance to validate intermediate results.

Confidence calibration. Ask the model to rate its confidence on each field. This is not perfectly calibrated — models are notoriously overconfident — but it correlates with accuracy well enough to be useful. Flag low-confidence extractions for human review. This turns the system from “fully automated” to “automated with targeted human oversight,” which is almost always the right design for high-stakes extraction.

The organizational pattern

Here is the pattern we see play out. A team adopts structured outputs. Their parse-error rate drops to zero. They report a quality improvement to leadership. Leadership approves scaling the system to more document types. The team scales it. Accuracy on the new document types is poor — but since the output is always valid JSON, the failures are silent. They surface weeks later when a downstream consumer notices bad data.

The root cause is that the team measured format compliance and called it quality. These are different things. Format compliance is a necessary condition for a usable system. It is not a sufficient condition for a correct one.

The takeaway

Structured outputs are a serialization layer. They guarantee that the model’s answer fits your schema. They do not guarantee that the answer is right.

The heuristic: after you adopt structured outputs, your error rate on formatting should drop to zero. If your overall error rate drops by the same amount, you had a formatting problem. If it doesn’t, you have a thinking problem — and you need to solve it with better prompts, better examples, and better validation.

tl;dr

The pattern. Teams adopt JSON mode or function calling, watch their parse-error rate drop to zero, report a quality improvement to leadership, and scale a system where the outputs are well-formatted but the extractions are still wrong 30% of the time. The fix. Treat structured outputs as the serialization layer they are, then separately improve extraction quality with better prompts, few-shot examples from your actual corpus, domain-specific validation, and decomposition of complex documents into targeted steps. The outcome. You stop conflating format compliance with correctness, silent failures surface before they reach downstream consumers, and the system earns the trust that clean JSON was giving it for free.

// co-written with ai · edited by humans

← all field notes Start a retainer →

// related notes

Your cloud keys should not exist 9 min
Hiring engineers in the age of AI 12 min
Monitoring AI systems is not monitoring APIs 9 min