№ 17 evals Dec 06, 2024 · 7 min read

Your annual AI review should fit on one page

If you cannot summarize your AI program's impact in one page — what shipped, what it cost, what it changed — you do not understand your own program.

It is December. Someone on your leadership team asks for a retrospective on the AI program. Your team produces a 30-page deck. There are timelines, architecture diagrams, model comparison tables, a section on “learnings,” and a roadmap that extends to Q3 next year. It takes two weeks to write. Nobody reads past slide 8.

The length is not a sign of thoroughness. It is a sign that the team cannot identify what actually matters.

The one-page format

A useful annual AI review has four sections. Each section gets a few lines. If you cannot fill a section, that tells you something. If you need more than a few lines, you are hiding behind detail.

What shipped. List 3–5 things that are in production, serving real users, today. Not “we explored.” Not “we prototyped.” Not “we have a proof of concept that leadership was excited about.” What shipped. If the list is shorter than 3, your AI program has not yet earned its budget. That is useful information. Do not pad the list with work-in-progress.

What it cost. Total spend — compute, headcount, tooling, data labeling, everything. Break it down per shipped feature. This number is often uncomfortable. A feature that cost $400k to build and saves $50k per year is not a good investment yet. Write the number down anyway. Intellectual honesty about costs is what separates a program that will improve from one that will get cut.

What it changed. User metrics and business metrics. Did support ticket volume drop. Did user engagement increase. Did revenue change. Did time-to-resolution decrease. Use actual numbers, not percentages of percentages. If you cannot connect your AI features to a business metric, either the measurement is missing or the impact is.

What we would do differently. Two or three honest statements about what did not work. Not “we learned a lot about embeddings.” Specific operational lessons. “We spent 8 weeks on fine-tuning that delivered less accuracy improvement than a prompt change we made in 2 days.” “We shipped without an eval suite and spent a quarter recovering from a regression we could have caught.” This section is the most valuable part of the review. It is also the section teams most often skip.

Why the length matters

A 30-page retrospective serves the team’s need to justify its existence. A one-page retrospective serves the organization’s need to make decisions.

Leadership does not need to know your embedding dimensions or your chunking strategy. They need to know whether the AI program is working. Working means: it shipped things, those things cost a known amount, and those things had a measurable impact. Everything else is supporting detail that belongs in a team wiki, not in a review.

The discipline of compression is the discipline of understanding. If you cannot fit your program’s impact on one page, one of two things is true. Either the impact is too diffuse to articulate — which means the program lacks focus — or the team does not know which parts of their work actually mattered. Both are problems worth discovering.

The sections nobody wants to fill in

Cost per feature is the number that generates the most discomfort. Teams resist calculating it because the answer is often unflattering. An AI chatbot that cost $600k to build and serves 200 queries per day is an expensive system. Writing down the per-query cost forces a conversation about whether this is the right investment. That conversation is necessary. Having it in December is better than having it in June when someone else initiates it.

What it changed is the section that exposes measurement gaps. Many teams ship AI features without instrumenting them for business impact. They can tell you the model’s accuracy on their eval set. They cannot tell you whether users are better off. If this section is empty, the problem is not the review format — it is that the team has been building without feedback loops. Fix the instrumentation, not the review.

What we would do differently is the section that requires psychological safety. If the team writes “nothing, it all went great,” the review is useless. The real version includes statements that make someone uncomfortable. The feature that should have been killed earlier. The hire that was wrong for the role. The dependency on a vendor that turned out to be a bottleneck. These are the lessons that save you money next year.

How to use the review

The one-page review is not a filing exercise. It is a decision tool. After writing it, three decisions should be obvious.

Continue, expand, or cut. For each shipped feature, the cost and impact data tells you whether to keep investing. A feature with high impact and decreasing cost gets expanded. A feature with low impact and stable cost gets cut. A feature with high impact and high cost gets an optimization sprint.

Where to focus next year. The “what we’d do differently” section points directly at the highest-leverage changes. If you spent too long on fine-tuning, invest in prompt engineering infrastructure. If you shipped without evals, make evals the first project next quarter.

Whether the program is earning its budget. This is the question leadership is actually asking. The one-page format makes the answer legible. Either the AI program shipped things that moved business metrics, or it didn’t. If it didn’t, the review should say so — and explain what needs to change.

The heuristic

At the end of the year, write your AI program review on one page. Four sections: what shipped, what it cost, what it changed, what you’d do differently. If you cannot fill the page, your program needs focus. If you need more than a page, you need to decide what actually mattered. Either way, the constraint is the point.

tl;dr

The pattern. AI teams produce 30-page retrospectives to justify their existence, hiding the uncomfortable numbers — cost per feature, unmeasured business impact, mistakes worth repeating — behind architecture diagrams and roadmaps nobody reads past slide 8. The fix. Force the review into four sections on one page: what shipped, what it cost, what it changed, and what you would do differently — with actual numbers in each. The outcome. Leadership can make a real decision about the program’s budget, and the team finally has the specific operational lessons that will save them money next year.

// co-written with ai · edited by humans

← all field notes Start a retainer →

// related notes

If you can't eval it, don't ship it 10 min
Your agent is a cronjob. Name it that. 7 min
Eval-driven development 10 min