№ 31 org Jul 11, 2025 · 9 min read

AI teams need on-call. Not optional.

If your AI system is in production and nobody is on-call for it, you have decided that your users will be the ones who discover failures. That is a choice.

If your AI system is in production and nobody is on-call for it, you have made a decision. You have decided that your users will be the ones who discover failures. That is a choice you are making — you should at least make it consciously.

Most AI teams we work with don’t have on-call rotations. They have a Slack channel. Maybe a dashboard someone checks on Mondays. When something goes wrong, the signal path is: user notices bad output, user complains to support, support files a ticket, ticket gets triaged, engineer looks at it 3 days later, engineer discovers the model has been hallucinating since Thursday.

That is not an operational posture. That is hope.

AI failures are quiet

Traditional software fails loudly. A null pointer throws an exception. A database timeout returns a 500. A broken deployment triggers a health check failure. Your monitoring catches these because they are binary — the system either works or it doesn’t.

AI systems fail quietly. The model doesn’t crash. It returns a 200. The response looks plausible. It’s just wrong. Your user gets a confident answer that cites a document that was deleted 6 weeks ago, or a classification that’s subtly shifted because the input distribution changed, or a summary that omits the most important paragraph.

No alert fires. No error log gets written. The system is running perfectly — it’s just producing garbage.

This is why traditional monitoring is necessary but not sufficient. You need health checks and latency tracking and error rate dashboards, yes. But you also need monitoring that understands the outputs.

What on-call for AI actually means

On-call for AI systems is not the same as on-call for a web service. You’re watching for different things.

Output distribution shifts. If your classification model usually returns category A 40% of the time and it suddenly starts returning category A 80% of the time, something changed. Maybe the model updated. Maybe the input distribution shifted. Either way, a human should look at it.

Drift detection. Compare today’s outputs to last week’s outputs on similar inputs. If the distribution is moving, you want to know before your users do.

Latency anomalies. LLM latency is noisy, but it’s not random. If your p95 doubles overnight, either the provider is having issues or your prompts got longer or your retrieval is returning more context. All of these matter.

Cost spikes. A bug in your chunking logic can 10x your token usage overnight. A retry loop that doesn’t back off can burn through your API budget in hours. If you’re not alerting on cost, you will get a surprise invoice.

Eval regression. Run your eval suite on a schedule — daily at minimum. If your accuracy on the held-out set drops below your threshold, page someone. Don’t wait for the weekly review.

The “but we’re a small team” objection

Every AI team we’ve talked to about on-call has the same response: we’re too small. We can’t afford a rotation. We only have 3 engineers.

You have 3 engineers and a production system that serves users. Traditional engineering teams your size have on-call. The AI team doesn’t get an exemption because the system is newer or less understood. If anything, the opposite is true — less understood systems need more operational rigor, not less.

The rotation doesn’t need to be heavy. Start with:

One person is primary each week. They carry a phone.
Alerts fire for: eval regression below threshold, latency p95 above target, cost anomaly above 2x daily average, output distribution shift above threshold.
Response expectation: acknowledge within 30 minutes during business hours, 2 hours outside.
Escalation path: if primary can’t resolve, they pull in the model owner.

That’s it. Four alert types. One person per week. Acknowledgment SLAs. This is not a large operational burden. It is the minimum bar for running a production system.

What to monitor — concretely

Here is the monitoring stack we recommend for most AI systems:

Tier 1 — page someone.

Eval suite accuracy drops below threshold (run daily).
Latency p95 exceeds 2x baseline for 15 minutes.
Error rate exceeds 5% for 10 minutes.
Daily cost exceeds 2x trailing 7-day average.

Tier 2 — ticket, investigate within 24 hours.

Output distribution shift detected (KL divergence above threshold).
New failure mode appears in error logs (novel error string).
Retrieval hit rate drops below baseline (for RAG systems).
User feedback negative rate increases above baseline.

Tier 3 — review weekly.

Model provider changelog (did the model update?).
Input distribution trends (are users asking different questions?).
Cost trends (are we drifting up?).
Eval suite coverage (are we testing the right things?).

The specific thresholds depend on your system. But the structure doesn’t. You need all three tiers, and you need them before your users start complaining.

The eval suite is your smoke detector

The most important piece of the monitoring stack is the eval suite running on a schedule. Everything else — latency, cost, error rates — those are infrastructure metrics. They tell you the system is running. They don’t tell you the system is right.

Your eval suite tells you the system is right. It is the only thing in your monitoring stack that checks the quality of the outputs. If your eval suite is only running in CI — only running when someone pushes a code change — you are missing the most important class of failures: the ones that happen when nothing in your code changes.

Model provider updates. Retrieval index drift. Input distribution shifts. These all degrade quality without any deployment. Your CI pipeline doesn’t catch them because there’s nothing to trigger the pipeline.

Run your evals daily. On production data if possible, on a representative sample if not. Compare against your baseline. Alert when it drops.

The organizational problem

The deeper issue is organizational. Most companies treat AI systems as a special category — not quite software, not quite data, something new that doesn’t fit existing operational patterns. This leads to operational gaps.

The infrastructure team doesn’t own the AI system because “that’s the ML team’s thing.” The ML team doesn’t do ops because “that’s infrastructure’s job.” Nobody is on-call because nobody owns the full stack.

The fix is simple in concept and hard in execution: someone owns the production AI system end-to-end. That person — or that team — is on-call for it. They are responsible for the model and the infrastructure and the pipeline and the outputs. They don’t get to say “the model is fine, it must be an infrastructure issue” or “the infrastructure is fine, it must be a model issue.” They own both.

This is the same pattern that DevOps solved a decade ago for traditional software. You build it, you run it, you get paged for it. AI systems don’t get a special exemption.

The heuristic

If your AI system is in production and nobody gets paged when it fails, you don’t have a production system. You have a demo that happens to be serving users.

The bar: run evals daily, alert on regressions, have one person on-call per week, and treat output quality as a production metric — not a research metric. Do this before you build the next feature.

tl;dr

The pattern. AI systems fail silently — returning a 200 with a plausible but wrong answer — and the signal path from failure to fix runs through user complaints, support tickets, and a three-day triage queue. The fix. Stand up a four-alert on-call rotation (eval regression, latency spike, cost anomaly, output distribution shift) with one primary per week before you ship anything to users. The outcome. Output quality becomes a production metric you catch internally instead of a research metric your users discover first.

// co-written with ai · edited by humans

← all field notes Start a retainer →

// related notes