← all field notes
№ 40 ops Nov 21, 2025 · 9 min read

The AI audit your board will eventually ask for

Sooner or later, someone — a board member, a regulator, a customer — will ask you to prove your AI systems are working correctly. Here is how to be ready before they ask.


Someone is going to ask. It might be a board member who read an article about AI risk. It might be a regulator with a new framework. It might be a customer whose contract requires an AI addendum. It might be your insurance carrier.

The question will be some version of: “How do you know your AI systems are doing what you think they’re doing?”

And you will either have an answer or you will not. The difference between those two states is about 40 hours of work — if you do it proactively. If you do it under pressure, it is 400 hours and a significant distraction from everything else your team is supposed to be shipping.

What an AI audit actually looks like

Strip away the compliance language and an AI audit is four questions.

What AI are you running? An inventory of every AI system in production — not just the chatbot your marketing team launched, but the recommendation model in your product, the classification system in your support pipeline, the summarization tool your ops team built in a weekend, and the 14 GPT wrappers various teams are using via personal API keys.

Most companies do not have this inventory. They have a partial list that covers the systems built by the ML team. They do not have the systems built by product teams, the systems bought from vendors, or the systems adopted by individual employees. The first step in being audit-ready is knowing what you are running. You cannot govern what you cannot see.

How do you know it is working? Documented evaluation criteria for each system. What does “working” mean for this specific system? What metrics do you track? How often do you measure them? What are the thresholds for acceptable performance?

For a customer-facing chatbot, “working” might mean: answer accuracy above 90% on a curated test set, hallucination rate below 2%, response latency under 3 seconds, and no responses that violate your content policy. For a document classification system, “working” might mean: precision above 95% on your top 10 categories, with a human review step for anything classified with low confidence.

The key is that “working” is defined, measured, and documented — not assumed. “Our users seem happy” is not an audit answer. “Here are last quarter’s eval results showing 92% accuracy on our 200-question test set” is.

What happens when it is wrong? Every AI system produces wrong output. The question is not whether it will be wrong — it is what happens when it is. Do you have incident detection? Do you have a response process? Do you have a way for users to flag bad output? Do you have a log of past incidents and how they were resolved?

This is where most companies have the biggest gap. They built the AI system. They might even eval it regularly. But they have no incident process. When the model produces a bad output, someone notices, someone fixes the prompt, someone deploys — and none of it is documented. There is no trail. There is no way to look back and say “here are the 7 incidents we had last quarter, here is what caused them, here is what we changed.”

Where does the data come from? Data lineage. What data does each AI system use? Where does it come from? How is it processed? Who has access? How is it stored? Is any of it PII? Is any of it subject to data residency requirements?

This is the question regulators care about most and engineers care about least. The model is a function of its data. If you cannot trace the data, you cannot explain the output. And if you cannot explain the output, you have a governance problem that no amount of model evaluation will solve.

Why you should build this before you are asked

The cost of building an AI governance framework proactively is small. A spreadsheet, some documentation, a quarterly review cadence. Maybe 40 hours of work spread across a few people.

The cost of building it reactively — when the board asks, when the regulator sends a letter, when the customer requires it for contract renewal — is an order of magnitude higher. Not because the work is different, but because the context is different.

Under pressure, you are doing archaeology. You are reverse-engineering which systems use which data. You are asking engineers to reconstruct eval results from 6 months ago. You are discovering AI systems that nobody on the leadership team knew existed. You are doing all of this while also trying to maintain the appearance that you have it under control.

Under pressure, you also make bad governance decisions. You over-restrict. You implement heavy-handed approval processes that slow down development. You create compliance theater — checkboxes and review boards that produce documentation without producing understanding. The reactive governance framework is almost always worse than the proactive one, and it costs 10x more to build.

Build it now. It is easier, cheaper, and produces a better result.

The minimum viable governance framework

You do not need a Chief AI Ethics Officer. You do not need a 50-page policy document. You do not need a governance platform. You need three things.

A spreadsheet

One row per AI system. Columns:

  • System name
  • Owner (a person, not a team)
  • What it does (one sentence)
  • What data it uses
  • How it is evaluated (link to eval results)
  • Last eval date
  • Current performance (key metric and value)
  • Incident count (last quarter)
  • Risk level (high/medium/low — based on customer impact if the system produces wrong output)

This spreadsheet is the inventory. It is the thing you hand to the board member, the regulator, the auditor. It takes an afternoon to create and 30 minutes per quarter to update. It is the single most valuable governance artifact you can produce.

A quarterly review

Once per quarter, the owner of each AI system presents a 5-minute update: eval results, incidents, changes, and any concerns. The audience is a small group — your CTO, your head of product, maybe a legal representative.

The purpose is not approval. It is awareness. The review ensures that leadership knows what AI systems exist, how they are performing, and where the risks are. It creates a forcing function for the system owners to actually run their evals and document their incidents.

Keep it tight. 5 minutes per system. No slide decks. Just the spreadsheet row, updated, with a verbal summary. If you have 10 AI systems, the review takes less than an hour.

An incident log

Every time an AI system produces output that is wrong in a way that matters — not every typo, but every incident where the wrong output could have or did cause harm, confusion, or cost — log it.

The log is simple: date, system, what happened, what caused it, what was changed, who was involved. This is not a post-mortem for every incident. It is a line in a spreadsheet.

Over time, this log becomes your most valuable governance tool. It tells you which systems are fragile. It tells you what kinds of failures you are prone to. It tells you whether your fixes are working. And when someone asks “what happens when your AI is wrong,” you can show them the log and say: “Here is what happened. Here is what we did about it.”

The three questions auditors actually ask

We have sat in these meetings. Board reviews, customer audits, regulatory conversations. The questions are remarkably consistent.

“What AI are you running?” They want the inventory. They want to know the scope. They are trying to understand whether you know what you have. If you pull out the spreadsheet, this question takes 2 minutes. If you do not have the spreadsheet, this question takes 2 weeks.

“How do you know it’s working?” They want eval results. They do not need to understand the metrics — they need to see that you have metrics, that you measure them regularly, and that the results are within the thresholds you defined. The existence of a rigorous evaluation process is more reassuring than any specific number.

“What happens when it’s wrong?” They want the incident log. They want to see that you have a process — that when things go wrong, you detect it, respond to it, and learn from it. Companies that have an incident log with 12 entries and a clear pattern of improvement look better than companies that claim they have never had an incident. Zero incidents means you are not looking, not that nothing went wrong.

That is it. Three questions. If you can answer all three clearly and with documentation to support your answers, you pass. Not because you are perfect — nobody is — but because you are paying attention. And paying attention is what governance actually means.

The timeline

Start now. Not because an audit is imminent, but because the work is small and the payoff compounds.

Week 1: Build the inventory spreadsheet. Go talk to every engineering team. Find every AI system. Fill in the rows.

Week 2: For each system, confirm there is an eval process. If there is not — and for some there will not be — flag it. That is your priority list.

Week 3: Create the incident log. Retroactively fill it in from Slack threads and post-mortems if you can. Going forward, make it part of your incident response process.

Week 4: Schedule the first quarterly review. Put it on the calendar. Make it recurring.

Four weeks. Mostly part-time. And you will be ready for the question before anyone asks it.

tl;dr

The pattern. Companies build AI systems without governance, then scramble to create audit documentation under pressure — producing compliance theater that costs 10x more and protects the business less. The fix. Build a minimum viable governance framework now — an inventory spreadsheet, a quarterly review, and an incident log — before the board, a regulator, or a customer asks for it. The outcome. You answer the three audit questions (what AI are you running, how do you know it works, what happens when it is wrong) in minutes instead of weeks, and your governance actually improves your AI systems instead of just documenting them.


← all field notes Start a retainer →