← all field notes
№ 42 ops Dec 19, 2025 · 9 min read

Monitoring AI systems is not monitoring APIs

HTTP 200 does not mean the answer was right. AI monitoring requires output quality metrics, not just uptime and latency.


Your AI system is monitored. You have dashboards. Uptime: 99.9%. p95 latency: 2.3 seconds. Error rate: 0.1%. Everything is green. Everything looks healthy.

Your users are getting wrong answers. They have been getting wrong answers for three days. Your monitoring did not catch it, because your monitoring is not monitoring the right thing.

The gap

Traditional API monitoring answers one question: is the system running? Uptime, latency, error rate, throughput — these tell you whether the service is available and responsive. For a CRUD API, this is sufficient. If the service is up and returning 200s, it is probably working correctly.

AI systems break this assumption. An AI system can be 100% available, returning 200s with sub-second latency, and be 100% wrong. The model is running. The API is responding. The answers are garbage.

This happens more often than teams expect. A retrieval index gets corrupted — the system returns confident, well-formed answers based on the wrong documents. A prompt change introduces a subtle regression — the system answers most queries correctly but consistently fails on a specific category. A model update changes behavior in ways that are hard to detect from individual responses but obvious in aggregate.

Your Datadog dashboard will not catch any of these. It will remain green while your users lose trust.

What to monitor

AI monitoring requires a different set of metrics. Not instead of traditional monitoring — in addition to it. You still need uptime and latency. But you also need metrics that approximate output quality.

Output distribution tracking. Monitor the statistical properties of your outputs over time. Average response length. Vocabulary diversity. Frequency of refusal responses (“I cannot answer that”). Frequency of hedging language (“I’m not sure, but…”).

These are not direct measures of quality. They are proxies — and useful ones. If your average response length suddenly drops by 40%, something changed. If your refusal rate spikes from 2% to 15%, something is wrong. If every response starts with the same phrase, something is broken.

Set baselines during a period of known-good behavior. Alert on deviations beyond 2 standard deviations. The alert will not tell you what is wrong, but it will tell you something is wrong — which is infinitely better than finding out from a customer escalation.

Retrieval quality metrics. If you are running a RAG system, monitor the retrieval layer independently. Track the number of chunks retrieved per query, the similarity scores of retrieved chunks, and the percentage of queries that retrieve zero results.

A drop in average similarity score means your retrieval is returning less relevant documents. A spike in zero-result queries means your index is missing coverage. These are leading indicators — they degrade before the user-visible output degrades.

Confidence and uncertainty. If your system produces confidence scores — through calibrated probabilities, log probabilities, or a separate scoring step — track them. A decline in average confidence suggests the system is seeing queries it is less equipped to handle, or that the underlying data has drifted.

Not every system has native confidence scores. But you can add them. A simple approach: after generating a response, ask a second model (or the same model with a different prompt) whether the response answers the question. Track the agreement rate. A drop in agreement is a signal.

Cost per query. Monitor what each query costs — in API tokens, in compute, in dollars. Cost is a surprisingly good proxy for behavioral changes. If cost per query increases, the model is producing longer outputs or the retrieval is stuffing more context into the prompt. If cost decreases, outputs are getting shorter — which might mean the model is being less thorough.

Cost monitoring also catches runaway spending. A prompt change that triggers verbose reasoning chains can 3x your API bill before anyone notices. If you are monitoring cost per query with alerts, you catch it in hours, not at month-end.

Periodic eval runs. The most reliable quality signal: run your eval suite against production on a schedule. Daily, if you can afford it. Weekly at minimum.

Take a sample of production queries, run them through the system, and score the outputs against your golden set or with LLM-as-judge. Track the score over time. If it drops, investigate.

This is not a substitute for real-time monitoring. Eval runs are lagging indicators — they tell you about yesterday’s quality, not right now. But they are the most accurate quality signal you have, and they catch slow degradation that proxy metrics miss.

The dashboard

Here is what an AI monitoring dashboard should include, beyond the standard operational metrics:

  • Response length distribution (histogram, with 7-day rolling baseline)
  • Refusal rate (time series)
  • Retrieval similarity score distribution (if RAG)
  • Zero-result retrieval rate (if RAG)
  • Cost per query (p50, p90, p99)
  • Eval score (latest run, trend over last 30 days)
  • Output diversity score (unique n-grams as a fraction of total n-grams)

Each of these should have an alert threshold. Start generous — you do not want alert fatigue on day one. Tighten the thresholds as you build intuition about what normal looks like.

The incident you will catch

Here is a real pattern we have seen: a team updated their embedding model as part of a routine dependency upgrade. The new model had slightly different dimensional characteristics. The retrieval index was rebuilt, but the similarity scores shifted — documents that previously scored 0.85 now scored 0.72. The retrieval was still returning results, so no errors were thrown. But the results were less relevant. Answer quality degraded gradually over two weeks.

With traditional monitoring, this is invisible. With retrieval quality monitoring, the similarity score drop is caught within hours.

The heuristic

If your AI monitoring dashboard has the same metrics as your API monitoring dashboard, you are not monitoring your AI system. You are monitoring the container it runs in. Add output distribution tracking, retrieval quality metrics, cost per query, and periodic eval runs. The system can be up and wrong. Your monitoring should know the difference.

tl;dr

The pattern. Teams monitor AI systems the same way they monitor APIs — uptime, latency, error rate — which stays green while the model returns wrong answers for days because HTTP 200 says nothing about whether the response was correct. The fix. Add output distribution tracking, retrieval similarity scores, cost per query, and scheduled eval runs against a golden set on top of your standard operational metrics. The outcome. Silent quality regressions — like a corrupted retrieval index or a prompt change that breaks a query category — get caught in hours instead of via customer escalation.


← all field notes Start a retainer →