№ 33 evals Aug 08, 2025 · 10 min read

Your A/B test is lying because your baseline is moving

A/B testing AI features is harder than A/B testing traditional features because the model itself changes. Your control group is not constant. Your experiment is corrupted.

You ran an A/B test on your AI feature. The treatment group saw a 12% improvement in task completion. The test ran for 4 weeks. You shipped it. Congratulations — your results are probably wrong.

Here’s why. During those 4 weeks, your model provider pushed two updates. Your retrieval index was re-built twice. Your prompt was edited once by an engineer who “just fixed a typo.” The system your users experienced in week 1 was not the same system they experienced in week 4.

Your control group was not constant. Your treatment was not stable. Your A/B test measured something, but it wasn’t what you think it measured.

The fundamental problem

A/B testing requires a stable treatment and a stable control. You change one thing — the treatment — and measure the difference. If both sides are changing simultaneously, you can’t attribute the outcome to the treatment.

Traditional software features are stable once deployed. A new checkout flow doesn’t morph over time. The button stays blue. The copy stays the same. The logic doesn’t drift.

AI features drift by default. The model changes when the provider updates it. The retrieval results change when the index is rebuilt. The prompt changes when someone edits it. The guardrails change when the safety team updates the rules. Even the input distribution changes as users adapt their behavior to the system.

In a 4-week A/B test, you might see:

2 model updates from your provider (often unannounced for minor versions).
1-2 retrieval index rebuilds as new documents are ingested.
1-3 prompt changes as the team iterates.
Continuous input distribution shift as users learn what works.

Each of these changes affects both the control and the treatment, but not equally. A model update might improve the treatment while degrading the control — or vice versa. You can’t tell, because you didn’t isolate the variable.

Version everything

The first fix is version control for your entire AI stack, not just your code.

Model version. Pin the model version for both control and treatment. If your provider doesn’t support version pinning — if you’re calling gpt-4o instead of gpt-4o-2024-08-06 — you are running an experiment where the treatment changes without your knowledge. Pin the version. If the provider pushes a breaking update, that’s a reason to restart the experiment, not to let it continue.

Prompt version. Your prompt is not code that lives in a repo and gets deployed through CI. It should be, but for most teams it isn’t. During an A/B test, freeze the prompt. No edits. No “small fixes.” If someone changes the prompt, the experiment is contaminated.

Retrieval configuration. Freeze the retrieval config: the embedding model, the chunk size, the reranker, the number of results. If your index rebuilds during the experiment, rebuild both control and treatment simultaneously from the same snapshot.

Guardrails and post-processing. Version your guardrails configuration. A new content filter that blocks certain outputs will change your completion rate, which will change your metrics, which will corrupt your experiment.

This is a lot of things to freeze. That’s the point. AI systems have more moving parts than traditional features, which means A/B tests require more discipline, not less.

Run shorter, wider

Traditional A/B tests run for 2-4 weeks to accumulate statistical significance. For AI features, this is too long. Too many things change in 4 weeks.

The fix: run shorter experiments with larger populations. Instead of 4 weeks at 5% traffic, run 1 week at 20% traffic. You get the same sample size in a quarter of the time, and you reduce the window for confounders.

This isn’t always possible. Some metrics — retention, conversion over time — require longer observation windows. For those, you need a different approach: cohort analysis with versioned snapshots. Group users by the exact system version they experienced, not just by the time window. A user who experienced model version A with prompt version 3 is in a different cohort than a user who experienced model version B with prompt version 3, even if they’re both in the “treatment” group.

This is more analytical work. It’s also more honest.

Offline evals gate online experiments

Here’s the pattern we recommend: offline evaluation comes before A/B testing. Not instead of it — before it.

Your eval suite is your first gate. Run the new prompt, the new model, the new retrieval config against your eval dataset. Compare accuracy, relevance, latency, and cost against your baseline. If the offline eval doesn’t show improvement, don’t bother with the A/B test. You’re not going to find a signal in production that you can’t find in evaluation.

If the offline eval does show improvement, then you’ve earned the right to run an A/B test. But the A/B test is now answering a narrower question: does the improvement in eval translate to an improvement in user behavior? That’s a much cleaner experiment.

The eval suite also gives you a stable baseline. Your A/B test baseline drifts because the production system drifts. Your eval baseline is fixed — same inputs, same expected outputs, measured on every version. If your eval score drops during the A/B test, you know the system changed. You can decide whether to continue or restart.

The metrics problem

Even if you fix the versioning problem, AI A/B tests have a metrics problem. What are you measuring?

Traditional A/B tests measure user behavior: clicks, conversions, time on page. These are well-understood metrics with well-understood statistical properties.

AI feature metrics are harder. “Answer quality” is not a metric you can measure directly from user behavior. A user who gets a wrong answer might not know it’s wrong — they’ll click through, seem satisfied, and only discover the problem later. A user who gets a correct but verbose answer might bounce — looking like a negative signal when the system actually worked.

Proxy metrics are necessary but treacherous. Common ones:

Task completion rate. Did the user finish what they started? But completion doesn’t mean the answer was right.
Reformulation rate. Did the user rephrase their query? High reformulation might mean the system is bad, or it might mean the user is exploring.
Thumbs up/down. Direct feedback, but biased toward strong opinions and heavily affected by UI placement.
Time to resolution. How long did it take the user to get what they needed? But you’re measuring a noisy signal over a long time horizon.

None of these are great. The best approach is to combine multiple metrics and look for directional agreement. If task completion is up, reformulation is down, and feedback is positive — you probably have a real improvement. If the signals disagree, you don’t have a clear result and you shouldn’t ship based on the A/B test alone.

What to do when you can’t A/B test

Sometimes A/B testing is impractical. Your user base is too small. The feature is too niche. The metric requires too long an observation window.

In those cases, lean on offline evaluation and qualitative assessment:

Run your eval suite on the new version. Compare to baseline.
Have domain experts review a sample of outputs. Rate them blind — don’t tell them which version produced which output.
Deploy to a small group of internal users first. Collect structured feedback.
Ship with a kill switch and monitor closely for the first week.

This is not as rigorous as a well-run A/B test. But a well-run A/B test on an AI feature is harder than most teams think, and a poorly-run A/B test gives you false confidence — which is worse than no data at all.

The heuristic

Before running an A/B test on an AI feature: pin every version (model, prompt, retrieval, guardrails), run offline evals as a gate, and prefer 1 week at high traffic over 4 weeks at low traffic. If you can’t freeze the system for the duration of the experiment, you can’t A/B test it. Use offline evals instead and be honest about the uncertainty.

tl;dr

The pattern. A/B tests on AI features run for four weeks while the model provider pushes updates, the retrieval index rebuilds, and someone edits “just a typo” in the prompt — so the control group is never actually constant and the result measures drift, not the treatment. The fix. Pin every moving part (model version, prompt, retrieval config, guardrails) for the entire experiment window, run offline evals as a gate before starting, and prefer one week at high traffic over four weeks at low traffic to shrink the contamination window. The outcome. You ship changes based on what your treatment actually did instead of what the background noise of a drifting system happened to produce during your experiment.

// co-written with ai · edited by humans

← all field notes Start a retainer →

// related notes

If you can't eval it, don't ship it 10 min
Your agent is a cronjob. Name it that. 7 min
Eval-driven development 10 min