№ 14 ops Oct 25, 2024 · 8 min read

The GPU bill is not the expensive part

Your AI system's real cost is the engineer debugging a hallucination at 2am, the product manager re-explaining the limitations to sales, and the trust you lose with every wrong answer.

Every AI cost conversation we walk into starts with the cloud bill. How much are we spending on inference. Can we use a smaller model. Should we self-host. The GPU line item is visible, legible, and easy to optimize. It is also the smallest cost in most AI systems.

The expensive part is everything else.

The costs that don’t have line items

Engineering time on non-deterministic debugging. A traditional bug has a stack trace. You read it. You find the line. You fix it. An AI bug has a prompt, a model response, a retrieval result, and a user complaint that says “the answer was wrong.” There is no stack trace. The same input might produce a different output tomorrow. Your engineer spends four hours reproducing the issue, three hours tracing it to a retrieval problem, and two hours writing a targeted eval to make sure it doesn’t happen again. That is a full day for one bug report. At senior engineer rates, that day costs more than a week of inference.

Product management overhead. Your PM is in a meeting explaining to sales why the AI feature “sometimes gets things wrong.” Sales wants a guarantee. The PM cannot give one. This meeting happens every two weeks. It is never the same meeting twice because the failure modes keep changing. The PM starts building a spreadsheet of “known limitations” that gets longer every sprint. This is not product management. This is reputation management. It is expensive and it does not scale.

Support cost. A wrong answer from a traditional system is a bug — file a ticket, we’ll fix it. A wrong answer from an AI system is a trust event. The user does not think “there’s a bug.” The user thinks “this system doesn’t work.” Support has to triage whether the answer was actually wrong, whether it was wrong in a way that matters, and whether the user has lost confidence in the product. This is a fundamentally different support interaction. It takes longer. It requires more context. It often escalates.

Trust erosion. This is the most expensive cost and the hardest to measure. Every wrong answer costs you a fraction of user trust. Users who lose trust stop using the feature. Usage drops. The team looks at the metrics and concludes the feature is not valuable. They scale it back or kill it. The feature was valuable — the reliability was not. But by the time you realize this, the window has closed.

Why teams undercount these costs

The GPU bill arrives on the first of the month. It has a number on it. You can graph it. You can set alerts on it. You can optimize it.

Engineering time spent on AI debugging does not have its own line item. It is hidden inside sprint velocity. Your team shipped fewer features this quarter. Why? Partially because two engineers spent a cumulative three weeks investigating hallucination reports. This cost does not appear in any dashboard. It appears as “we’re a little behind on the roadmap.”

Product management overhead does not have a line item either. It appears as “the PM seems really busy” and “we need another PM.” You hire another PM. The cost is now $180k per year in salary, benefits, and overhead. Nobody connects this to AI reliability.

Support cost shows up as “ticket volume is higher than expected.” You hire another support engineer. Another $120k. Nobody connects this to the model that confidently told a customer their contract included a feature it did not.

The math nobody does

Here is a rough accounting we have done with a few teams. The numbers are anonymized but the ratios are real.

Monthly GPU and API costs: $8k. The number everyone talks about.

Monthly engineering time on AI-specific debugging and maintenance: $24k. Three engineers, each spending roughly 30% of their time on AI reliability work instead of feature development.

Monthly product management overhead for AI limitations and expectations: $6k. One PM spending roughly 40% of their time on AI-related stakeholder management.

Monthly support cost increment from AI-related tickets: $4k. Higher handle time, more escalations, more “is this right?” questions.

Quarterly trust recovery costs — re-engagement campaigns, user interviews, feature re-launches after reliability incidents: $15k per quarter, call it $5k per month.

Total monthly cost: $47k. The GPU bill is 17% of it.

The fix is not cheaper GPUs

Optimizing the GPU bill is fine. Use smaller models where they work. Cache common queries. Batch requests. These are good engineering practices. They save real money.

But they do not touch the 83% of cost that lives outside the cloud bill. To reduce those costs, you need reliability.

Evals. An eval suite that catches regressions before they reach users. This directly reduces engineering debugging time and support ticket volume. A good eval suite pays for itself within a month — not in GPU savings, but in engineering time recovered.

Monitoring. Not just uptime monitoring. Output monitoring. Track answer confidence, retrieval quality, and user feedback signals in production. When something starts degrading, you find out from your dashboard, not from your support queue.

Graceful degradation. When the model is unsure, say so. “I’m not confident in this answer — here are the source documents” costs you nothing and saves you a trust event. Teams that build graceful degradation into their AI systems see dramatically lower support costs. The wrong answer is expensive. The honest “I don’t know” is cheap.

Scope control. The AI features with the worst cost ratios are the ones with the broadest scope. “Ask anything about our product” is a hallucination machine. “Get a summary of your last 5 invoices” is a tractable problem. Narrow scope means fewer failure modes, which means less debugging, less support, and less trust erosion.

The conversation to have

Next time someone asks “how do we reduce our AI costs,” don’t open the cloud console. Open the engineering time tracker. Open the support ticket system. Open the PM’s calendar.

Add up the hours your team spends on AI reliability work — debugging, explaining, apologizing, rebuilding trust. Multiply by your fully loaded cost per hour. Compare that number to your GPU bill.

The GPU bill is the easy cost. The hard costs are the ones you are already paying but have not yet named. Name them. Measure them. Then invest in the reliability work that actually reduces them.

The heuristic

If your monthly engineering time spent on AI debugging and maintenance exceeds your monthly inference cost, your reliability investment is too low. Fix the reliability. The other costs follow.

tl;dr

The pattern. Teams optimize the GPU bill while ignoring the engineering time, product management overhead, and trust erosion that together account for 83% of their actual AI costs. The fix. Build evals, output monitoring, and graceful degradation into your AI system so reliability failures get caught before they reach users. The outcome. When wrong answers stop reaching users, the debugging hours, support tickets, and stakeholder management meetings that consume your team quietly disappear from the sprint.

// co-written with ai · edited by humans

← all field notes Start a retainer →

// related notes