№ 22 ops Feb 21, 2025 · 8 min read

Prompt versioning is not optional

If you cannot tell me which prompt was running in production last Thursday at 3pm, you cannot debug a regression. Prompts are code. Version them like code.

Last month a client reported that their AI-powered support system had started giving worse answers. Not catastrophically worse — subtly worse. Longer responses, less specific, occasionally missing the point of the question.

We asked when it started. They were not sure. Sometime in the past two weeks, maybe. We asked what changed. They checked their deploy logs. No code changes. No model changes. No data pipeline changes.

After two hours of investigation, we found it. A developer had tweaked the system prompt — changed three sentences — as part of an unrelated PR. The change was buried in a string literal inside a Python file. It was not called out in the PR description. The reviewer did not notice it. There was no way to correlate the change with the behavior regression because there was no record of which prompt version was running at any given time.

This is the default state of prompt management at most organizations. It is not good.

The current reality

Most teams store prompts as string literals in application code. Sometimes they are in a constants file. Sometimes they are inline in a function. Sometimes they are split across multiple files and assembled at runtime. Occasionally they are in a database, editable via an admin panel, with no version history at all.

The common thread: there is no systematic way to know which prompt was running at a given time, no way to roll back to a previous version without a code deploy, and no way to correlate prompt changes with changes in system behavior.

This would be unacceptable for any other part of the system. You would not run a database migration without tracking which schema version is active. You would not deploy a config change without recording what changed and when. But prompts — which are arguably the most sensitive part of an AI system, the part that most directly controls behavior — get treated as informal text edits.

Why this matters operationally

Prompts are not documentation. They are not comments. They are runtime configuration that directly determines system behavior. A one-word change in a prompt can shift the model’s output distribution in ways that are difficult to predict and difficult to detect without proper monitoring.

When something goes wrong — and it will — you need to answer three questions:

What prompt was running when the bad output was generated?
What was the previous prompt, and when did it change?
Did the behavior change correlate with the prompt change, or is something else going on?

If you cannot answer question 1, you cannot debug the problem. You are guessing. You might fix it by accident. You might make it worse.

The minimum viable approach

You do not need a prompt management platform. You do not need a SaaS tool. You need three things you already have.

1. Prompts live in version-controlled files.

Move your prompts out of application code and into dedicated files. We use YAML, but the format does not matter. What matters is that each prompt is a discrete artifact with its own change history.

prompts/
  support-system-prompt.yaml
  summarization-prompt.yaml
  classification-prompt.yaml

Each file contains the prompt text, a version identifier, and any metadata that is relevant — when it was last changed, who changed it, why.

When a developer wants to change a prompt, they change a file. That change shows up in a PR. It gets reviewed. It gets merged. It has a timestamp, an author, and a commit hash. This is not new technology. This is git.

2. Each deploy records the active prompt version.

Your deployment process should capture which prompt versions are active. This can be as simple as logging the git commit hash of the prompts directory at deploy time. Or including the prompt version identifiers in your application’s health check endpoint. Or writing them to a deploy manifest.

The goal is that when someone asks “which prompt was running at 3pm last Thursday,” you can answer in under a minute.

3. Your logs include the prompt identifier.

Every LLM call should log which prompt version was used. Not the full prompt text — that is wasteful and potentially a security concern. Just the version identifier. A hash, a semver string, a timestamp — anything that lets you join your request logs to your prompt history.

With these three pieces, you can do something that most teams cannot: correlate prompt changes with behavior changes. When accuracy drops, you check what prompt was running. When you roll out a new prompt, you compare metrics before and after. When a regression occurs, you roll back to the previous version and confirm the regression resolves.

What this enables

Once you have prompt versioning, a set of practices becomes possible that is impossible without it.

Prompt rollbacks. When a new prompt makes things worse, you roll back. This takes seconds if your prompts are in config files. It takes a full deploy cycle if they are in application code.

A/B testing. Run two prompt versions simultaneously, route traffic between them, and compare results. This is just feature flagging. Your existing feature flag system can do it — if the prompt version is a config value rather than a hardcoded string.

Prompt auditing. For regulated industries, you may need to demonstrate which prompt was active when a specific output was generated. This is trivially easy with proper versioning. It is nearly impossible without it.

Regression detection. If your evals run on every prompt change — the same way your unit tests run on every code change — you catch regressions before they ship. This requires the prompt change to be a discrete, observable event. String literal edits buried in code are not observable events.

The objection

“This is overengineering. We only change prompts occasionally.”

You change prompts more often than you think. Every time someone tweaks the system prompt “real quick” to fix an edge case, that is a prompt change. Every time someone adds a clarifying sentence because a user reported a bad answer, that is a prompt change. These changes are invisible precisely because the prompts are not tracked.

The teams that tell us they “rarely change prompts” are the same teams that cannot explain when their last prompt change was. They are not changing prompts rarely. They are changing prompts without noticing.

The heuristic

If you cannot tell me which prompt version was running in production at any arbitrary point in the past, you do not have prompt management — you have prompt chaos. The fix takes a day: move prompts to files, log the version on each call, record the active version at deploy time. You already have git and a logging system. Use them.

tl;dr

The pattern. Prompts get changed as buried string literals in unrelated PRs, so when behavior quietly degrades — longer responses, missed intent, subtle regressions — the team cannot trace the change that caused it or roll it back without a full deploy. The fix. Move prompts to dedicated version-controlled files, record the active prompt version in every LLM call’s logs, and capture which version deployed at each release. The outcome. Regressions become debuggable in minutes instead of hours, rollbacks take seconds, and A/B testing and prompt auditing become possible because prompt changes are finally observable events.

// co-written with ai · edited by humans

← all field notes Start a retainer →

// related notes

Your cloud keys should not exist 9 min
Hiring engineers in the age of AI 12 min
Monitoring AI systems is not monitoring APIs 9 min