№ 10 org Aug 23, 2024 · 11 min read

Hire the infra engineer before the ML engineer

Your first AI hire should not be someone who trains models. It should be someone who can deploy them, monitor them, and wake up when they break.

Your first AI hire should not be someone who trains models. It should be someone who can deploy them, monitor them, and wake up when they break. We have watched this play out at a dozen companies now. The order matters.

The hiring mistake

A company decides to invest in AI. They open a req for a “Senior ML Engineer.” The job description mentions model training, fine-tuning, feature engineering, and maybe some research. They hire someone good — someone with a strong background in machine learning, papers on their resume, experience with PyTorch.

That person arrives on day one and asks reasonable questions. Where is the GPU cluster? How do we deploy models to production? What is the CI/CD pipeline for model artifacts? Where do experiment metrics get logged? What is the monitoring setup?

The answers are: we don’t have one, we haven’t figured that out yet, there isn’t one, nowhere, and there isn’t one.

So the ML engineer — the person you hired to improve models — spends their first six months writing Dockerfiles, setting up a model registry, building a deployment pipeline, configuring monitoring, and arguing with the platform team about Kubernetes resource limits.

This is a waste. Not because the work is unimportant — it is critical. But because you hired someone whose expertise is in modeling and asked them to do infrastructure. They will do it adequately. An infrastructure engineer would do it well.

What the ML engineer actually needs

An ML engineer is productive when the following things exist:

A way to deploy a model. Not “push a Docker image and open a PR to update the Kubernetes manifest.” A pipeline. Code goes in, an API endpoint comes out. Canary deployment. Rollback. The ML engineer should not have to think about load balancers.

A way to monitor a model. Request latency, error rates, input/output distributions, drift detection. Not just application-level monitoring — model-level monitoring. Is the distribution of predictions changing? Are confidence scores dropping? This is specialized infrastructure, but it is infrastructure.

A way to run experiments. A/B testing or shadow mode for new models. The ability to route a percentage of traffic to a new version and compare metrics. Without this, every model change is a yolo deploy.

A way to log and query predictions. Every prediction should be logged with its input, output, latency, and model version. This data is how the ML engineer diagnoses problems and measures improvements. Without it, they are guessing.

A way to manage training data. Versioned datasets, labeling pipelines, data quality checks. The ML engineer should be improving the model with better data and better architectures — not building the data pipeline from scratch.

None of these are ML problems. They are infrastructure problems. They require someone who thinks in terms of systems, pipelines, reliability, and operational excellence. Someone who has built and maintained production services. Someone who knows what a pager feels like.

The right first hire

The right first AI hire is a senior backend or infrastructure engineer who is curious about ML. Not an ML engineer who can tolerate infrastructure. The distinction matters.

This person has built production services before. They know how to set up CI/CD, monitoring, alerting. They can design a data pipeline. They can stand up a model serving layer — whether that is a FastAPI wrapper, a managed service like SageMaker endpoints, or a simple API gateway in front of an LLM provider. They understand operational concerns: what happens at 3am when the model serving pod OOMs? What happens when the upstream data source changes its schema?

They do not need to know how to train models. They need to know how to deploy, monitor, and operate them. They need to be curious enough about ML to understand the domain — to know why drift detection matters, why you cannot just A/B test a model like a button color, why latency percentiles matter more than averages for inference workloads.

This person builds the platform. When the ML engineer arrives — hire number two or three — they walk into a functioning environment. They can focus on what they are actually good at: improving the models. Their first week is running experiments, not writing Terraform.

The compound effect

The order creates a compound effect. When the ML engineer is productive from day one, you get model improvements faster. Those improvements produce results. Results justify more investment. More investment means more hires. The next hires — whether ML engineers, data engineers, or applied scientists — all benefit from the platform that hire number one built.

When you do it in the other order, you get the opposite. The ML engineer spends months on infrastructure. The infrastructure is adequate but fragile — built by someone whose heart is in modeling, not operations. When the second hire arrives, they inherit infrastructure that needs to be rebuilt. The compound effect runs in reverse.

We have seen this pattern at companies ranging from 50-person startups to 500-person mid-market companies. The ones that hired infra first shipped their first AI feature in 2-3 months. The ones that hired ML first shipped in 6-9 months — and then spent another 3 months stabilizing the infrastructure.

The objection

The objection we hear most often is: “But we need someone who understands ML to make architectural decisions. What if the infra engineer builds the wrong thing?”

This is a valid concern with a straightforward answer. The infra engineer does not need to work in a vacuum. You can get ML architecture guidance from a consultant, an advisor, or even a part-time hire. What you cannot easily outsource is the day-to-day work of building and maintaining production infrastructure. That requires someone embedded in the team, full-time, who owns the system.

The other objection: “We want to start with fine-tuning / training a custom model.” If this is genuinely your starting point — not an API-based AI feature but a custom model — then yes, you need ML expertise first. But most companies in 2024 are not training models. They are using APIs. They are building applications on top of foundation models. For this work, the infrastructure is the bottleneck, not the modeling.

The job description

If you are writing the req for your first AI hire, here is what it should look like:

Senior backend/infrastructure engineer with production experience
Has built and operated services at scale (you get to define what “scale” means for your context)
Familiar with ML concepts — does not need to train models but should understand the lifecycle
Comfortable with model serving infrastructure (Ray, TorchServe, Triton, or even just FastAPI)
Has opinions about monitoring and observability
Willing to carry a pager for the AI system

Notice what is absent: no mention of papers, no mention of research, no mention of model architectures. Those matter — for hire number two.

The takeaway

Your first AI hire builds the stage. Your second AI hire performs on it.

The heuristic: if your ML engineer is writing Dockerfiles, you hired in the wrong order.

tl;dr

The pattern. Companies open their first AI req for a Senior ML Engineer, who arrives to find no deployment pipeline, no monitoring, and no experiment infrastructure, and spends six months doing infrastructure work they were never hired to do. The fix. Make your first AI hire a senior backend or infrastructure engineer who understands ML concepts and can build the deployment, monitoring, and experiment platform that the ML engineer will actually need to be productive. The outcome. The ML engineer you hire second walks into a functioning environment, ships model improvements in their first week instead of their sixth month, and the entire AI investment compounds faster because the stage was built before the performer arrived.

// co-written with ai · edited by humans

← all field notes Start a retainer →

// related notes