№ 18 ops Dec 20, 2024 · 9 min read

Model migrations are database migrations

Switching models is not swapping an API key. It changes your outputs, your latency, your costs, and your eval results. Treat it with the same rigor as a database migration.

A team we were advising switched from GPT-4 to GPT-4o on a Friday afternoon. Changed the model string in their config, deployed, went home for the weekend. By Monday they had 40 support tickets. The outputs were different — slightly different phrasing, different formatting, different handling of edge cases. Their downstream parsing code broke on 15% of responses. Their eval scores dropped 8 points. Their latency improved, which was nice, but nobody noticed because they were too busy triaging the regressions.

This was not a negligent team. They were experienced engineers who understood that model changes have consequences. They just underestimated how many consequences, and they treated the change like a config update instead of a migration.

What changes when you change a model

A model is not a library with a stable API. It is a function with deterministic inputs and non-deterministic outputs. When you change the function, everything downstream of it changes too.

Outputs. This is the obvious one, and teams still underestimate it. Different models produce different text for the same prompt. The differences are often subtle — a word choice here, a formatting choice there. But if you have code that parses model outputs — extracting JSON, splitting on delimiters, matching patterns — subtle differences break things. A model that returns {"answer": "yes"} and a model that returns {"answer": "Yes"} are functionally different if your code does an exact string match.

Latency. Different models have different speed profiles. Switching from a larger model to a smaller one usually improves latency. Switching providers — say, from OpenAI to Anthropic — changes latency in unpredictable ways that depend on routing, server load, and context length. If you have SLAs or timeout settings tuned to your current model’s latency profile, a model change can violate them.

Cost. Pricing varies by model and by provider. A change that looks like a drop-in replacement might double your per-token cost, or halve it. If you are processing high volumes, this matters. If you have a budget that assumes a specific cost-per-query, a model change is a budget change.

Token limits and context windows. Models have different context windows. A prompt that fits in one model’s context might not fit in another’s. If your system dynamically constructs prompts — stuffing retrieved chunks into context — you need to verify that your prompts still fit. A prompt that silently gets truncated because it exceeds the context window produces wrong answers without raising an error.

Eval results. Your eval suite was built and calibrated against a specific model’s behavior. Your thresholds, your scoring rubrics, your golden set — all of these assume a particular output style. A new model might score differently on your eval even if the actual quality is equivalent. You need to re-baseline, not just re-run.

The database migration analogy

Software engineers learned decades ago that database schema changes are dangerous. A schema migration can break queries, corrupt data, and take down production. The industry developed a discipline for this: migration scripts, rollback plans, staged rollouts, shadow reads, canary deploys. Nobody changes a database schema on a Friday afternoon.

Model changes have the same risk profile. They change the shape of your system’s outputs. They can break downstream consumers. They require testing against production-like data. They need rollback plans.

The discipline should be the same.

The migration plan

Here is the process we recommend. It is not novel. It is the database migration playbook applied to models.

Step 1: Run the eval suite. Before deploying anything, run your full eval suite against the new model. Compare scores to your current model’s baseline. Look at the overall score, but also look at per-category breakdowns. A model might score the same overall but regress on a specific category that matters to your users.

Step 2: Compare outputs. Take a sample of 100–200 production queries. Run them through both models. Diff the outputs. Look for systematic differences — formatting changes, tone changes, refusal patterns, verbosity differences. This step often reveals issues that the eval suite misses because the eval is measuring accuracy and the issue is formatting.

Step 3: Check the plumbing. If you have code that parses model outputs — JSON extraction, regex matching, structured output parsing — test it against the new model’s outputs. This is where most migrations break. The model is fine. The parsing code is not.

Step 4: Shadow test. Deploy the new model alongside the old one in production. Send real traffic to both. Log the new model’s responses but serve the old model’s responses to users. Compare the outputs over a few days of real traffic. This catches issues that synthetic testing misses — unusual query patterns, edge cases in production data, load-dependent behavior.

Step 5: Canary deploy. Send 5–10% of production traffic to the new model. Monitor error rates, latency, user feedback, and downstream system health. If anything degrades, roll back. If everything looks stable after 24–48 hours, increase the percentage.

Step 6: Cut over. Move 100% of traffic to the new model. Keep the old model configuration available for immediate rollback. Monitor closely for a week.

This process takes 1–2 weeks for a straightforward model upgrade. It takes longer if the model change involves a provider switch or a significant capability difference. This is not slow — this is responsible.

The shortcuts that hurt

“The new model is just a minor version bump.” Minor version bumps can still change outputs. GPT-4-0613 and GPT-4-1106 are both “GPT-4” and they behave differently. Test every change. There is no safe shortcut.

“We’ll just watch the dashboards after deploying.” By the time the dashboards show a problem, your users have already seen it. Shadow testing and canary deploys exist specifically so your users don’t have to be your test suite.

“Our prompts are model-agnostic.” No they are not. Prompts are tuned — consciously or unconsciously — to the behavior of the model they were written for. A prompt that works well with Claude might not work well with GPT-4, and vice versa. Model-agnostic prompts are a useful aspiration and a dangerous assumption.

“We can always roll back.” Can you? How fast? Is the rollback automated? Have you tested it? A rollback plan that exists only as an idea in someone’s head is not a rollback plan. Script it. Test it. Time it.

The organizational discipline

Model migrations should have owners. Not the whole team — one person who is responsible for the migration plan, the eval comparison, the shadow test, and the cut-over decision. This person is the equivalent of the DBA who runs the schema migration. They do not need to be the most senior engineer. They need to be the most careful one.

Model changes should have calendar entries. Not “we’ll switch sometime next week.” A specific date, with a specific rollback window, and a specific person on-call for the first 48 hours. Same as a database migration.

Model changes should have runbooks. What to check. What thresholds to watch. When to roll back. Who to notify. This document takes an hour to write and saves a day of chaos when something goes wrong.

The heuristic

Treat every model change — version bump, provider switch, or capability upgrade — with the same rigor as a database schema migration. Eval, compare, shadow, canary, cut over. If you would not change your database schema on a Friday afternoon with no rollback plan, do not change your model that way either.

tl;dr

The pattern. Teams treat model changes as config updates and discover on Monday that the new model’s slightly different formatting broke their parsing code, shifted their eval scores, and generated 40 support tickets over the weekend. The fix. Run evals, diff production outputs, shadow test in production, canary deploy at 5–10%, then cut over — with a named owner, a calendar entry, and a tested rollback plan. The outcome. Model upgrades become routine, regressions get caught before users see them, and the team builds the confidence to migrate often instead of deferring until a model is deprecated.

// co-written with ai · edited by humans

← all field notes Start a retainer →

// related notes

Your cloud keys should not exist 9 min
Hiring engineers in the age of AI 12 min
Monitoring AI systems is not monitoring APIs 9 min