What evaluation criteria should a RAG pipeline use?

A robust evaluation framework should check seven criteria: aggregate improvement, statistical significance, query degradation rate, per-type performance breakdown, effect size, overflow/data loss tracking, and consistency across runs. A majority must pass for GO decision, with hard vetoes on degradation and overflow.

Why is query degradation rate important in RAG evaluation?

Aggregate metrics can hide widespread damage. If 40% of individual queries degrade while the average improves, the system is broken for those queries. Tracking per-query degradation reveals when improvement in some areas masks failures in others.

How can statistical significance be misleading in retrieval research?

A statistically significant p-value indicates whether two distributions are different, not whether results are correct or meaningful. The D-23 experiment produced a near-zero p-value showing all metrics were zero due to a code bug. The statistical test was correct while the result was completely wrong.

Your RAG Pipeline Has a Check Engine Light. You're Ignoring It.

February 25, 2026 · ~10 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

I ran a retrieval experiment that returned perfect zeros across all 36 queries, and every automated check I'd built said "statistically significant." The decision engine considered seven criteria, passed two of them, and issued a NO-GO. The pipeline caught the problem. Not me--the pipeline.

Here's what scares me: most production RAG systems don't have a pipeline like that. They don't have decision criteria. They don't have rollback thresholds. They don't have a concept of "this retrieval result is wrong and we should know about it automatically." They ship a model, run some spot checks, and move on to the next sprint.

Your RAG pipeline has a check engine light. You just never installed it.

The Dashboard You Don't Have

What I built for FractalRecall's D-23 experiment isn't perfect. (It caught a real problem but also passed metrics that were wrong.) It exists, though, which puts it ahead of most production systems I've seen.

The framework evaluates seven criteria before issuing a GO or NO-GO:

Aggregate Improvement. Did the overall metrics improve compared to baseline? Not "did they change"--did they move in the right direction by a meaningful amount?
Statistical Significance. Was the improvement unlikely to be random? (This is where things get interesting, because "statistically significant" and "correct" are not the same thing. More on that in a moment.)
Query Degradation Rate. What percentage of individual queries got worse? An aggregate improvement can hide widespread degradation if a few queries improved dramatically.
Per-Type Analysis. Did performance improve across different query types (factual, relational, comparative, authority), or only in some categories?
Effect Size. Is the improvement practically meaningful, not just mathematically detectable?
Overflow Rate. Did the enrichment process silently discard data? How much? Is that acceptable?
Stability. Are the results consistent across runs, or do they fluctuate?

Seven criteria. A majority must pass for a GO decision, with hard vetoes on degradation rate and overflow. The framework runs automatically after every experiment.

This took about a day to build. It has caught two genuine problems (the D-22 overflow and the D-23 metric bug). It has saved me from publishing results I would have had to retract. And it is, as far as I can tell, more evaluation infrastructure than most companies apply to their production RAG systems.

Why Nobody Builds the Dashboard

I have theories about this, and none of them are flattering.

Retrieval quality is invisible. If your chatbot returns a wrong answer, the user might not notice. They didn't know the right answer--that's why they asked. The feedback loop that exists for other software bugs (crash reports, error logs, angry support tickets) barely exists for retrieval errors. The system fails silently, and silence is comfortable. (This also explains the "it works on my test queries" approach: developers run five queries they already know the answers to and call it validated. Confirmation bias with a search bar.)

Evaluation is boring. Building a RAG pipeline is interesting. Choosing an embedding model! Tuning chunk sizes! Experimenting with re-ranking! Evaluation is where you write tests, compute metrics, and stare at tables of numbers. The vegetables of machine learning and the bane of my existence.

The metrics are hard. They have jargon names--precision, recall, NDCG, MRR--but the questions they ask are simple. Did you return the right stuff? Did you find all of it? Is the best result near the top? How far does the user scroll before hitting something useful? The problem is that "retrieval quality" is not one thing. A system can return only relevant results but miss half the relevant documents. It can ace factual queries and completely miss relational ones. Understanding the metrics means understanding those tradeoffs, and the tradeoffs require caring enough to learn.

I suspect the boring one is strongest. Evaluation doesn't feel like progress. It feels like homework.

What Statistically Significant Actually Means (A Cautionary Tale)

D-23 is the single best argument I have for why evaluation infrastructure matters.

I ran 36 queries against a corpus enriched with eight layers of structural context: domain, entity type, authority status, temporal era, relationships, section heading, and sequence position. Every query returned zero relevant documents. Every metric (precision, recall, NDCG, MRR) came back as 0.000.

The statistical test I ran (comparing D-23 to D-22 baseline) returned a p-value (a confidence score for "this result isn't random") of approximately 0.0000000000000000007. That is not a typo. It was reporting near-absolute certainty that the results were different from baseline.

And it was correct! The results were different from baseline. They were zero. All of them. Because I had a bug in my metric computation where chunk IDs included a #chunk_002 suffix that didn't match the expected document filenames.

The statistical test did exactly what it was designed to do: determine whether two distributions were likely to be different. It is not designed to determine whether your code is correct. It is not designed to determine whether the result makes sense. It is designed to answer one narrow question about two sets of numbers, and it answered that question with extreme confidence.

If my evaluation framework had consisted only of "run the statistical test and check if p < 0.05 (the standard threshold for 'probably not random')," I would have concluded that multi-layer enrichment catastrophically degrades retrieval quality. That conclusion would have been supported by a p-value that most journals would kill for. And it would have been spectacularly wrong.

The GO/NO-GO framework caught it--not through the significance test, but through the degradation rate criterion. When 100% of queries degrade, something is broken, regardless of what the p-value says. The dashboard has multiple indicators because no single indicator is sufficient.

What Your Dashboard Should Actually Check

None of this is theoretical. These are checks that have caught real bugs in my own work.

1. Baseline Comparison (Non-Negotiable)

Every change to your retrieval pipeline (new embedding model, different chunk size, updated re-ranking) should be compared against a frozen baseline using the same queries and the same ground truth. Not "we ran some queries and it seemed better." A structured comparison with numbers.

If you don't have ground truth (a set of queries with known-correct answers), build it. Manually, if you have to. Twenty well-curated queries with verified answers is worth more than a thousand unverified spot checks.

2. Per-Query Degradation Rate

Aggregate metrics lie. They average out the disasters.

If your new pipeline improves average recall (the fraction of relevant content it actually finds) from 0.72 to 0.78 but degrades 40% of individual queries, you have not improved your system. You have improved your system for some users while making it worse for others. The aggregate number is a press release. The degradation rate is the truth.

My threshold: if more than 25% of queries degrade, the change does not ship. Period. Even if the aggregate improves. Especially if the aggregate improves--because that means a small number of dramatic improvements are masking widespread damage.

3. Overflow / Data Loss Tracking

If your pipeline involves any transformation that can discard data (token limits, chunk size constraints, format conversion, deduplication), track the loss rate. I learned this the hard way: silent data loss can hide behind improving metrics.

Assert on your chunk counts. Before indexing: N documents. After indexing: N documents. If those numbers differ, stop and find out why.

4. Type-Level Breakdowns

Don't just compute overall precision. Compute precision by query type, by document type, by topic. A system that's amazing at factual queries and terrible at relational queries is not "pretty good overall"--it's broken for relationship queries and nobody noticed because the factual queries pulled the average up.

In my experiments, authority queries ("Which documents are canonical?") behaved differently from relational queries ("What factions are connected to the Dvergr?"). Aggregate metrics hid this. Type-level breakdowns revealed it.

5. The Veto

Every framework needs a hard stop--a condition where the change does not ship regardless of any other metric. For me, it's 100% query degradation (which catches catastrophic bugs like D-23's) and overflow above 50% (which catches silent data loss like D-22's).

Your vetoes will be different. But you need them. Without a veto, there is always a way to rationalize shipping a broken change. "The aggregate improved." "The p-value is significant." "We're behind on the roadmap." A veto is a firewall against motivated reasoning.

The Boring Part Is the Important Part

I know this is not a sexy blog post. It doesn't have a dramatic twist or a surprising result. It says: build a dashboard, check your metrics, don't trust any single number.

But here's what I keep coming back to: I have a research project with three experiments, a corpus of 77 documents, and 36 queries. Small-scale, personal, low-stakes. And even at that scale, evaluation infrastructure caught two bugs that would have produced wrong conclusions.

If my little retrieval experiment needs a seven-criterion decision framework to avoid publishing nonsense, what does your production RAG system need? The one serving real users, handling real queries, influencing real decisions.

Whatever it is, you probably don't have it yet. And the reason you don't is not that you can't build it. It's that you haven't decided it's worth building. The vibes feel good. The spot checks pass. The users aren't complaining.

The check engine light is off because it was never wired in.

Wire it in.

The Dashboard You Don't Have​

Why Nobody Builds the Dashboard​

What Statistically Significant Actually Means (A Cautionary Tale)​

What Your Dashboard Should Actually Check​

1. Baseline Comparison (Non-Negotiable)​

2. Per-Query Degradation Rate​

3. Overflow / Data Loss Tracking​

4. Type-Level Breakdowns​

5. The Veto​

The Boring Part Is the Important Part​