← all field notes
№ 45 rag Feb 14, 2026 · 9 min read

Stop benchmarking on Wikipedia

Your retrieval benchmark is lying to you if it's on a corpus your model has seen.


Your retrieval benchmark is lying to you if it’s on a corpus your model has seen. Here is a small, cheap protocol for building an eval set on your actual corpus, plus a script that will tell you when your retriever has quietly regressed.

The problem

You built a RAG system. You benchmarked it on a public dataset — maybe Natural Questions, maybe HotpotQA, maybe something you found in a blog post. Your numbers look good. You ship it. Three weeks later, users are complaining that the answers are wrong.

The benchmark lied. Not because the benchmark is bad. Because the benchmark corpus is not your corpus.

Why public benchmarks fail you

Public benchmarks test retrieval on corpora that large language models have already seen during training. This means the model can sometimes answer the question correctly without retrieving anything. Your retrieval could be returning garbage and the benchmark would still show high accuracy.

Your corpus is different. Your internal docs, your Confluence pages, your Notion databases — the model has never seen these. When retrieval fails on your corpus, the model cannot compensate. The answer is wrong, and the user notices.

The protocol

Here is how to build an eval set that actually measures your retrieval quality:

Step 1. Pull 50 documents from your actual corpus. Pick them at random. Do not cherry-pick.

Step 2. For each document, write 2–3 questions that can only be answered by reading that specific document. Not trivia. Real questions that a user would actually ask.

Step 3. For each question, record the document ID that contains the answer. This is your ground truth.

Step 4. Run your retriever on each question. Check whether the correct document appears in the top-k results. This is your recall@k.

That is your eval set. 100–150 question-document pairs. It takes about a day to build. It is worth more than any public benchmark you will ever run.

The regression script

Once you have the eval set, run it on every deploy. If recall@10 drops by more than 5 points, block the deploy. This catches silent regressions — the kind where someone changes an embedding model or a chunking strategy and does not realize they just broke retrieval for 30% of queries.

tl;dr

The pattern. Teams benchmark their RAG retrieval on public datasets like Natural Questions or HotpotQA and get good numbers, then ship to production where the model is answering from internal documents it has never seen — and the retrieval quietly fails. The fix. Spend a day building 100–150 question-document pairs from your actual corpus, then run recall@k against it on every deploy and block if it drops more than 5 points. The outcome. You have a retrieval benchmark that measures your real system, and silent regressions from embedding model swaps or chunking changes get caught before they reach users.


← all field notes Start a retainer →