The demo is not the product
Getting an LLM to do the thing once in a notebook is the easy part. The hard part is getting it to do the thing reliably, at scale, for every user, on every edge case, at 3am.
Getting an LLM to do the thing once in a notebook is the easy part. The hard part is getting it to do the thing reliably, at scale, for every user, on every edge case, at 3am. Most teams confuse the first part for the second.
The notebook moment
Every AI project has a notebook moment. Someone opens a Jupyter notebook, pastes in some data, writes a prompt, hits shift-enter, and the output is shockingly good. The room gets excited. A Slack message goes out: “Look what I got working.” A demo gets scheduled for the end of the week.
The demo goes well. Leadership is impressed. A roadmap appears. Ship date: six weeks.
Here is the problem. That notebook moment — the one that created all the excitement — represents maybe 10% of the work. The other 90% is everything the notebook did not have to deal with.
The 90%
Error handling. The demo showed the happy path. In production, the API will return 429s. The model will occasionally produce unparseable output. The input data will contain characters that break your prompt template. The context window will overflow on long documents. Each of these needs a specific, tested recovery path.
Edge cases. The demo used 5 representative examples. Production will see thousands of variations, including the ones nobody anticipated. The contract written in French. The resume with no work experience section. The support ticket that is actually a love letter. Your system needs to handle all of them — or at least fail gracefully on the ones it cannot.
Latency. The demo ran synchronously and nobody cared that it took 8 seconds. In production, 8 seconds is an eternity. Now you need streaming, caching, prompt optimization, maybe a smaller model for simple cases and a larger one for hard cases. This is an architecture decision that touches every layer of the stack.
Monitoring. In the demo, a human looked at the output and said “that’s good.” In production, nobody is looking. You need automated quality checks, drift detection, cost tracking per request, latency percentiles, error rates by input type. You need alerts. You need dashboards. You need someone who looks at the dashboards.
Eval suites. The demo was evaluated by vibes. Production needs a test suite — a set of inputs with expected outputs that you run on every change. Building this suite is unglamorous work. Maintaining it is worse. But without it, you have no idea whether your next prompt change made things better or worse.
Graceful degradation. What happens when the model is down? What happens when latency spikes to 30 seconds? What happens when your vector store returns no results? The demo did not address any of these because they did not happen during the demo. In production, they will happen on a Tuesday afternoon when half the team is on PTO.
User feedback loops. The demo had no feedback mechanism. In production, you need to know when the system is wrong — and users will not tell you unless you make it trivially easy. Thumbs up/down, explicit corrections, implicit signals from behavior. This data is how you improve. Without it, you are flying blind.
Cost management. The demo made 5 API calls. Production will make 50,000 per day. At $0.01 per call, that is $500/day, $15k/month. Did the business case account for that? What about the calls that retry? What about the calls that hit the large model because the small model was not confident enough? Cost is an ongoing engineering problem, not a line item.
The mid-2024 demo wave
In mid-2024, the AI demo wave crested. Twitter was full of 30-second videos showing remarkable things: agents booking flights, copilots writing legal briefs, chatbots diagnosing medical conditions. Each demo was real. The model really did produce that output, in that context, on that input.
Most of them never shipped. Not because the technology did not work — it did, in the demo. They did not ship because the team that built the demo was not the team that could build the product. Or the team could build the product but the timeline assumed the demo was 80% of the work instead of 10%.
The ones that did ship — the ones that are still running — had something in common. They were built by teams that treated the notebook moment as the starting line, not the halfway point.
How to close the gap
Budget 10x the demo effort for production. If the demo took one engineer two weeks, the production system will take one engineer five months — or three engineers two months. This is not pessimism. This is base rates from every AI project we have seen ship successfully.
Build the eval suite before you build the product. The eval suite defines what “working” means. Without it, you are shipping based on vibes and hoping for the best. Start with 50 test cases. Get to 200 before you launch. Grow it every time you find a failure.
Design for failure from day one. Every LLM call can fail, return garbage, or take too long. Your architecture should assume this. Fallback paths, timeouts, retry logic, human-in-the-loop escalation — these are not nice-to-haves. They are table stakes.
Staff it like a production system. AI features need on-call rotations, incident response, and operational runbooks just like any other production system. The model is a dependency. It will break. Someone needs to wake up.
Separate the research from the engineering. The person who built the demo in a notebook is probably great at prompt engineering and model selection. They may not be the right person to build the production deployment pipeline. These are different skills. Both are necessary.
The heuristic
When someone shows you a demo that works, ask one question: “What happens when this is wrong?” If the answer is “it won’t be,” you are looking at a demo. If the answer is a specific, boring description of error handling, fallbacks, and monitoring — you are looking at a product.
tl;dr
The pattern. Teams mistake the notebook moment — getting an LLM to do the thing once on a clean example — for most of the work, then schedule a six-week ship date for what is actually a five-month engineering project. The fix. Budget 10x the demo effort for production, build the eval suite before you build the product, and staff the feature with on-call rotation and incident response from day one. The outcome. Features that ship are actually reliable: they handle the French contract, the malformed resume, and the 3am API timeout — not just the five representative examples that looked great in the demo.