← all field notes
№ 21 org Feb 07, 2025 · 11 min read

Your AI engineer is doing three jobs

Prompt engineering, data engineering, and ML engineering are three different skill sets. Your single 'AI engineer' is doing all three, badly. Split the role or accept the tradeoffs.


We keep meeting the same person. Different company, different title, same situation. They were hired as an “AI engineer.” They are writing prompts, building data pipelines, deploying models, setting up evals, managing GPU infrastructure, and fielding Slack messages from product managers who want to know why the chatbot said something weird yesterday.

They are doing three jobs. They are good at one of them. They are adequate at another. The third one is held together with duct tape and optimism. They are tired.

The three jobs

The “AI engineer” title has become a catch-all. When you unpack what the role actually requires, it splits into at least three distinct skill sets:

Prompt engineering and evaluation. This is the application-layer work. Writing prompts, iterating on them, building eval suites, analyzing failure modes, tuning for specific use cases. It is close to product work. The best prompt engineers think like product managers — they obsess over user intent, edge cases, and the gap between what the user asked and what the model understood.

Data engineering. This is the pipeline work. Ingesting documents, chunking them, building embeddings, maintaining vector stores, keeping data fresh, handling deduplication, managing metadata. It is unglamorous and critically important. Bad data pipelines produce bad retrieval, and bad retrieval produces bad answers — regardless of how good your prompt is.

ML infrastructure and operations. This is the deployment work. Serving models, managing GPU instances, optimizing latency, monitoring for quality regressions, handling failover, managing model versions. It is classic ops work adapted for a new stack. The skills transfer from traditional DevOps, but the specifics — quantization, batching strategies, KV cache management — are domain-specific.

These three jobs require different backgrounds, different tools, and different ways of thinking. Prompt engineering is iterative and experimental. Data engineering is methodical and plumbing-intensive. ML ops is reliability-focused and systems-oriented.

Where the breakdown happens

In theory, one strong generalist can handle all three. In practice, every person has a strongest skill and a weakest skill. The weakest skill becomes the bottleneck for the entire system.

Here is the pattern we see most often:

The AI engineer was hired for their ML background. They are good at model selection, evaluation, and prompt engineering. They can build a solid eval suite and iterate on prompts effectively. They are adequate at deployment — they can get a model running in production, even if the infrastructure is not optimally configured.

But the data engineering is where things fall apart. The ingestion pipeline is a series of scripts that run on someone’s laptop. The chunking strategy was chosen once and never revisited. There is no monitoring on data freshness. When the source documents change format, the pipeline breaks silently and nobody notices until a user reports bad answers three weeks later.

The team looks at the bad answers and assumes it is a model problem. They spend two weeks tuning prompts. The answers do not improve, because the problem is not the prompt — it is the data. But the data pipeline is the thing nobody is paying attention to, because the person responsible for it is also responsible for everything else and does not have time to instrument it properly.

We have seen this exact failure mode at least a dozen times in the last year. The details vary. The pattern does not.

The second most common failure

The other common version: the AI engineer is strong on data and prompts but weak on ops. The system works beautifully in development. The eval numbers are great. The demos are impressive.

Then it hits production traffic. Latency spikes. The autoscaling does not work because nobody configured it properly. The monitoring is basic — just uptime checks, no quality metrics. When the model starts producing worse outputs because a dependency changed, nobody notices for days.

The AI engineer knows the system needs better ops. They just do not have time to build it, because they are also maintaining the data pipeline and iterating on prompts for the next feature.

Why this happens

The root cause is organizational. Most companies hired their first AI engineer 12-18 months ago. That person was expected to build the whole stack — prototype to production. For a prototype, one person is fine. Prototypes do not need robust data pipelines or production-grade ops.

But prototypes become products. The scope grows. The traffic grows. The stakeholders multiply. And the team does not grow with it. The single AI engineer who built the prototype is now operating a production system alone, and the org has not noticed that the role has outgrown one person.

Part of the problem is that the hiring market uses “AI engineer” as a single role. Job postings list requirements that span all three skill sets — as if finding someone who is equally strong at prompt engineering, data engineering, and ML ops is a reasonable expectation. It is like posting a job for someone who is equally strong at frontend, backend, and infrastructure. That person exists, but they are rare, expensive, and probably already running their own company.

The fix

You have two options. Both are legitimate. Pick the one that matches your stage and budget.

Option 1: Split the role. If you can hire, the highest-leverage split is separating the application layer (prompts, evals, product integration) from the infrastructure layer (data pipelines, model serving, monitoring). These two halves have different cadences — the application layer changes daily, the infrastructure layer should change infrequently but must be reliable when it does.

You do not necessarily need to hire a third person immediately. A strong data engineer from your existing team can often take on the data pipeline work if given context. A strong DevOps engineer can take on model serving if given some upskilling. The ML-specific knowledge is thinner than people think — the operational patterns are familiar.

Option 2: Accept the tradeoff explicitly. If you cannot hire, decide which of the three areas will be weak and manage accordingly. This is not admitting defeat. This is being honest about constraints.

If ops will be weak, invest in managed services that reduce the ops burden. Use hosted model APIs instead of self-hosting. Use managed vector databases instead of running your own. Trade cost for reduced operational complexity.

If data engineering will be weak, invest in monitoring that catches data quality issues early. Instrument your pipeline with freshness checks, schema validation, and output sampling. You cannot fix what you cannot see.

If prompt engineering will be weak, invest in a strong eval suite and iterate more slowly. Fewer prompt changes, more thoroughly tested. Ship less frequently but with higher confidence.

The worst option is not choosing — letting all three areas be mediocre without a conscious decision about which one matters least.

The conversation to have

If you are a leader with an AI engineer on your team, ask them this: “Which of these three areas — prompts and evals, data pipelines, or deployment and ops — do you spend the least time on, and does that worry you?”

Their answer will tell you where your risk is. If they say “data pipelines” and they have a RAG system, you have a problem. If they say “ops” and you are running in production, you have a problem. If they say “prompts” and accuracy is slipping, you have a problem.

The heuristic: if one person is responsible for prompts, data, and infrastructure, identify which of the three is their weakest skill. That is where your next production incident will come from. Either shore it up with a hire, reduce the scope with managed services, or add monitoring so you see the failure before your users do.

tl;dr

The pattern. The single “AI engineer” is simultaneously responsible for prompt engineering, data pipelines, and ML ops — three different skill sets with different cadences — so the weakest one silently degrades until a production incident makes it visible. The fix. Either split the role at the application-versus-infrastructure boundary or explicitly decide which area will be weak and compensate with managed services, monitoring, or slower iteration cycles. The outcome. The team stops diagnosing prompt problems that are actually data problems, and the next production incident gets caught by instrumentation instead of a user complaint.


← all field notes Start a retainer →