← all field notes
№ 05 ops Jun 14, 2024 · 8 min read

Your model is not your moat

The model is a commodity. The moat is the data pipeline, the eval suite, the deployment infrastructure, and the feedback loop. Most teams invest in the wrong layer.


In the first half of 2024, we watched teams agonize over model selection. Weeks of evaluation. Benchmark comparisons. Internal bake-offs. Spreadsheets with weighted scoring rubrics. The decision felt momentous — like choosing a database or a cloud provider. A decision you’d live with for years.

It wasn’t. Most of those teams switched models within 6 months. Some switched twice. The models got cheaper, or faster, or a new one came out that was better for their specific use case. The decision that felt permanent was temporary.

The thing they actually lived with — the thing that was hard to change — was everything around the model.

The commodity layer

Models are commodities. Not yet in the economic sense — pricing varies, capabilities differ, there are real tradeoffs. But in the architectural sense. They are interchangeable components with a standard interface: text in, text out. Some are better at reasoning. Some are faster. Some are cheaper. The rankings shift every quarter.

If your architecture is clean, swapping models is a configuration change. If your architecture isn’t clean — if you’ve hardcoded model-specific prompt patterns, relied on undocumented behaviors, or built your system around a specific model’s quirks — swapping models is a rewrite.

The teams that treated model selection as the primary technical decision ended up coupling themselves to that decision. The teams that treated the model as a replaceable component ended up with systems that could adapt when the market shifted.

The layers that compound

The model doesn’t compound. It depreciates. Today’s best model is next quarter’s second-best model. But the infrastructure around it — done well — compounds.

The data pipeline. How do you get data into the system? How do you clean it, chunk it, embed it, index it? How do you handle updates? How do you deal with deletions? This is plumbing. It is unglamorous. It is the difference between a system that works on demo data and a system that works on production data. And it takes months to get right — not because it’s technically hard, but because production data is messy in ways you don’t discover until you’re in production.

The eval suite. How do you know if your system is working? Golden sets, regression tests, model-as-judge evaluations, A/B tests. Every test you write is an asset. Every eval you run generates signal. Over time, your eval suite becomes your institutional knowledge about what “good” means for your system. It’s the thing that lets you change the model, the prompt, the retrieval, the post-processing — and know whether the change helped or hurt.

The deployment infrastructure. How do you serve the model? How do you handle failures, retries, timeouts, rate limits? How do you do canary deployments? How do you roll back? This is standard infrastructure engineering applied to a new component. The teams that already had mature deployment practices adapted quickly. The teams that didn’t — the ones treating the AI feature as a special snowflake — built fragile systems that were painful to operate.

The feedback loop. How do you learn from production? How do you capture user behavior, error patterns, edge cases? How do you turn that into improvements? The feedback loop is the meta-system — the thing that makes everything else get better over time. Without it, you’re flying blind. With it, every day in production makes your system a little better.

The wrong optimization

The teams that optimized for model quality spent their time on prompt engineering, model evaluation, and benchmark comparison. These are useful activities. But they have diminishing returns and no compounding effect. The perfect prompt for GPT-4 is useless when you switch to Claude. The benchmark comparison is stale in a month.

The teams that optimized for operational quality spent their time on pipelines, evals, deployment, and feedback. These are boring activities. But they compound. The eval suite you build for GPT-4 works for Claude. The deployment infrastructure you build for one model serves the next. The feedback loop you establish gets richer every week.

There’s a useful analogy to web development in the 2000s. Early on, teams agonized over which web framework to use — the choice felt permanent. Over time, the framework became a replaceable component. What mattered — what compounded — was the deployment pipeline, the test suite, the monitoring, the team’s operational muscle. The teams that invested in those things could switch frameworks without rewriting their business logic.

We’re at the same inflection point with AI. The model is the framework. It matters, but it’s not the moat.

How to tell where you are

Here’s a quick diagnostic. Answer these questions about your AI system:

Can you swap models in under a day? If not, you’re coupled to your model. Decouple.

Can you tell, within an hour of deploying a change, whether the system got better or worse? If not, you don’t have evals. Build them.

Can you roll back a bad deployment in under 5 minutes? If not, you don’t have deployment infrastructure. Standard stuff — build it.

Can you point to a specific improvement that came from production feedback in the last month? If not, you don’t have a feedback loop. Start one.

If you answered “no” to more than one of these, you’re investing in the wrong layer. You’re polishing the model while the operational foundation rusts.

The heuristic

Spend 20% of your time on model selection and prompt engineering. Spend 80% on everything else — the pipeline, the evals, the deployment, the feedback loop. The model is what you ship today. The infrastructure is what lets you ship better tomorrow.

When you catch yourself debating which model to use, ask: does it matter? If the operational infrastructure is solid, you can try both and measure. If the infrastructure isn’t solid, it doesn’t matter which model you pick — you won’t be able to tell if it’s working anyway.

tl;dr

The pattern. Teams spend weeks on model selection and prompt engineering — work that depreciates every quarter — while neglecting the data pipeline, eval suite, deployment infrastructure, and feedback loop that compound in value over time. The fix. Spend 20% of your effort on model selection and 80% on the operational layer: decouple your architecture so a model swap is a config change, build evals that survive model migrations, and establish a feedback loop that learns from every day in production. The outcome. When a better or cheaper model arrives — and it will — you can evaluate and switch in a day instead of a rewrite, and the rest of your infrastructure keeps getting better regardless of which model sits inside it.


← all field notes Start a retainer →