← all field notes
№ 27 ops May 09, 2025 · 9 min read

Fine-tuning is maintenance, not a one-time cost

The fine-tuning run is the easy part. The hard part is the data pipeline, the evaluation cadence, the retraining schedule, and the deployment workflow that follows.


The fine-tuning run is the easy part. You curate a dataset, configure a training job, wait for it to finish, deploy the model. A senior engineer can do this in a day. The hard part is everything that comes after — and “after” is where most teams get stuck.

The one-and-done fallacy

Teams treat fine-tuning like a deployment. Train it, ship it, move on to the next thing. This works for about 90 days. Then one of the following happens:

The base model gets a major update. Your fine-tune was built on GPT-4o-2024-08-06. The provider ships a new version. Your fine-tune is now pinned to the old model. You can keep using it, but you are missing out on improvements — and eventually the old version gets deprecated.

Your training data goes stale. The product changed, the terminology shifted, new features were added, old workflows were removed. The fine-tuned model confidently describes a UI that no longer exists. Users notice.

Distribution shift happens. The queries your users send in month 6 look different from the queries they sent in month 1. The model was trained on month 1 queries. Its performance degrades gradually — not catastrophically, just enough that users start saying “it used to be better.”

These are not edge cases. They are the normal lifecycle of a fine-tuned model. If you are not planning for them, you are planning to be surprised.

What maintenance actually looks like

A production fine-tuned model needs five operational components. Not all at once — you can build them incrementally — but eventually you need all five.

1. A training data pipeline. Not a one-time CSV export. A pipeline that continuously collects, cleans, and formats new training examples. The best source is usually production traffic — real user queries paired with good responses, reviewed by a human. This is boring work. It is also the most important work, because the quality of your training data is the ceiling on your model’s performance.

The pipeline does not need to be fancy. A script that pulls flagged interactions from your production logs, formats them into the training schema, and appends them to a versioned dataset is enough. Run it weekly. Review the output manually. Remove the garbage.

2. A versioned dataset. Every training run should be traceable to a specific version of the training data. When your model starts producing bad outputs — and it will — you need to diff the training data between the last good version and the current one. Without versioning, debugging is guesswork.

Git works for small datasets. DVC works for large ones. The tool matters less than the discipline.

3. An evaluation suite. A set of test cases that measure the model’s performance on the things you care about. Not perplexity — task-specific metrics. If your model classifies support tickets, measure classification accuracy on a held-out set. If it generates code, measure pass rates on a curated set of problems. If it writes customer emails, have a human score a sample every week.

The eval suite is your early warning system. Run it after every training run, and run it on a schedule against your production model even when you have not retrained. If the scores drop, something changed — the data, the queries, or the base model.

4. Retraining triggers. When do you retrain? Two approaches, and you should use both.

Time-based: retrain on a fixed schedule. Monthly is a reasonable starting point for most use cases. Quarterly if the domain is stable. Weekly if the domain changes fast — financial data, news, trending topics.

Metric-based: retrain when your eval scores drop below a threshold. This requires the eval suite from step 3. Set an alert. When accuracy drops below 85% — or whatever your threshold is — trigger a retraining run.

The time-based trigger catches gradual drift. The metric-based trigger catches sudden degradation. You need both.

5. A deployment workflow. You trained a new version. How do you ship it? Not by swapping the model endpoint and hoping for the best.

The minimum viable deployment workflow: train the new model, run the eval suite against it, compare scores to the production model, deploy to a shadow environment (same traffic, no user-facing output), compare shadow outputs to production outputs, promote to production if the metrics are better.

If you want to be more rigorous — and you should, if the model serves paying customers — add an A/B test. Route 10% of traffic to the new model for a week. Measure user satisfaction, error rates, and task completion. Promote or roll back based on the data.

The decision framework

Here is the question we ask teams before they start a fine-tuning project: “Can you commit to maintaining this model for 12 months?”

Not the initial training run. The pipeline, the evals, the retraining schedule, the deployment workflow. All of it. For a year.

If the answer is yes, fine-tuning is probably the right call. You will get better performance than prompt engineering, and you will be able to maintain that performance over time.

If the answer is no — and it is often no, because the team is small, or the use case is not important enough to justify the operational overhead — stick with prompt engineering. A well-crafted prompt with few-shot examples gets you 80% of the performance of a fine-tuned model with 20% of the operational burden.

There is no shame in prompt engineering. There is significant risk in fine-tuning without the infrastructure to maintain it.

The hidden cost

The thing nobody mentions in the fine-tuning tutorials: the operational cost of maintaining a fine-tuned model is typically 3-5x the cost of the initial training run, annualized. The training run is a GPU bill. The maintenance is a people bill — engineers reviewing training data, running evals, debugging regressions, managing deployments.

Budget for this upfront or do not fine-tune at all.

The heuristic

Fine-tuning is not a project. It is a commitment. If you cannot build and staff the five operational components — data pipeline, versioned dataset, eval suite, retraining triggers, deployment workflow — then prompt engineering is the better choice. The initial performance gap is smaller than you think. The maintenance gap is larger than you think.

tl;dr

The pattern. Teams treat a fine-tuning run as a one-time deployment, then get caught off guard when the base model updates, training data goes stale, and query distribution drifts — all within 90 days. The fix. Before starting a fine-tuning project, confirm you can staff and maintain the five operational components — data pipeline, versioned dataset, eval suite, retraining triggers, and deployment workflow — for at least 12 months. The outcome. Teams that plan for the full maintenance burden ship fine-tuned models that stay accurate over time; teams that don’t end up with degraded models nobody wants to own.


← all field notes Start a retainer →