Skip to main content

I Added Context to My Embeddings and 43% of My Data Disappeared

Terminal showing embedding pipeline results: 218 chunks input, 124 surviving, 94 silently dropped, with retrieval metrics improving despite data loss
· ~7 min read
Ryan Goodrich
Technical Writer, AI Enthusiast, and Developer Advocate

In Part 1, I mentioned the D-22 experiment almost as an aside. Twenty-four tokens of metadata prefix, 16.5% improvement in ranking quality, 27.3% recall jump. Good numbers. Clean story.

I left out the part where 43% of my data vanished.

Not "performed poorly." Not "returned lower-quality results." Vanished. Ninety-four of 218 chunks silently dropped from the index because I added one sentence of context and didn't do the arithmetic on what that sentence would cost. The embedding pipeline didn't warn me. ChromaDB didn't complain. I only noticed because I'm the kind of person who checks row counts after every insert. (This is not a personality trait. It's scar tissue.)

The results improved anyway. That's the part I need to explain.

Silent Overflow

Embedding models have a token budget. nomic-embed-text-v1.5 accepts up to 8,192 tokens per input. My chunks were sized for a 1,024-token target window. Most fit comfortably.

Then I added the prefix.

"Domain: faction. Entity: Iron-Banes Alliance. Canon: true."

Twenty-four tokens on average. The chunks that were already near their limit tipped over. The embedding pipeline didn't raise an exception or log a warning. It just skipped them. Ninety-four chunks, gone. The index built successfully, the queries ran, the metrics came back looking healthy. If I hadn't compared the chunk count in ChromaDB against the source count in my data directory, I'd have published the D-22 results without ever knowing half the corpus was missing.

This is the failure mode nobody warns you about in RAG tutorials. Not "bad results." Invisible data loss that makes your results look better than they should.

The Confound

Here's the D-22 data, which I presented in Part 1 without enough caveats:

MetricD-21 (Baseline)D-22 (Prefix)Change
Recall@100.7200.917+27.3%
NDCG@100.7060.823+16.5%
Precision@50.3830.417+8.8%
MRR0.8450.861+1.9%

Every metric improved. With 43% of the corpus missing.

I sat with this for a while. Then I sat with it some more. Then I wrote a findings document, because that's apparently how I process confusion.

The problem: I can't tell you how much of this improvement came from the metadata enrichment and how much came from accidentally deleting the worst chunks. Two things happened simultaneously, and I didn't instrument the experiment to separate them.

Three Explanations

The dropped chunks were noise. Maybe the chunks that overflowed were the weakest ones, already too long, too dense, too full of prose that added bulk without adding signal. Dropping them was accidental data cleaning. The corpus got leaner and leaner corpora retrieve better.

Plausible. Also uncomfortable. If losing 43% of my data improves results, that's less "my enrichment strategy is brilliant" and more "my chunking strategy was terrible."

The metadata was doing the heavy lifting. Without the prefix, the embedding model knew a chunk was about armor, warfare, and territorial disputes. With the prefix, it knew this was a canonical faction document about a specific organization. The semantic fingerprint sharpened. Queries about factions found faction documents. Queries about authority found canonical documents. The prefix acted as a disambiguator, not just a label.

This is what I believe. This is what matters for FractalRecall's thesis.

Both. The enrichment improved the surviving chunks and the overflow removed the weakest chunks. Two effects, additive. The improvement was real but inflated by a confounding variable I can't measure because I didn't log which chunks overflowed.

This is probably the honest answer.

What 24 Tokens Replace

The prefix is three facts in one sentence:

Domain: faction. Entity: Iron-Banes Alliance. Canon: true.

A domain classification. An entity name. An authority flag. Roughly the same token cost as "once upon a time, in a land far, far away," except this version actually helps retrieval.

Compare that to what those tokens replaced in the chunks that overflowed:

The Iron-Banes Alliance maintains a complex hierarchical structure that has evolved significantly over the centuries, reflecting both its martial origins and its subsequent...

Twenty-four tokens of throat-clearing. It told the embedding model the chunk was about some kind of hierarchy that evolved. The prefix told it the chunk was about a specific canonical faction. The model doesn't care about your prose style. It cares about information density.

This is the lesson I keep circling back to from Part 1: in information retrieval, what you tell the system is more valuable than what you show it. RAG pipelines hand the model raw text and say "figure it out." Embedding models are good at this. But inferring topic, domain, entity type, and authority from 974 tokens of narrative is harder than just being told in 24 tokens of structured prefix.

It's the difference between handing someone a novel and asking "what genre?" versus writing "MYSTERY" on the spine. The novel contains everything needed to answer the question. The label is faster.

What I'd Do Differently

Three things, in order of how much sleep they've cost me.

Log the overflow. I know 94 of 218 chunks overflowed. I don't know which 94. Were they systematically different from the survivors? Longer? From specific document types? Without IDs, I can't separate the enrichment signal from the overflow cleaning signal. D-23 logs everything.

Reserve the token budget. Dropping 43% of chunks is not a strategy. It was an arithmetic oversight: I sized chunks for 1,024 tokens, then added prefix tokens that pushed some of them over. D-23 addresses this with a prefix_reserve mechanism that pre-allocates token budget for enrichment, guaranteeing zero overflow. (It produced zero overflow. It also produced all-zero metrics, because I introduced a different bug entirely. Research is glamorous.)

Start with the confound, not the result. I presented the D-22 numbers in Part 1 as a clean improvement story. They're not clean. The improvement is real, probably, but the magnitude is inflated by uncontrolled data loss. I should have led with the caveat. I'm leading with it now.

The Actual Takeaway

Twenty-four tokens of the right information outperformed 974 tokens of undifferentiated narrative. The prefix told the embedding model what the topic was. The raw text told it about the topic. That's a different kind of information, and embedding models use it better than most RAG engineers expect.

Also: check your chunk counts after indexing. Always. Learn from my 43%.


This is Part 2 of the Research Notebooks series. Part 1: Context Windows Are a Lie. Next: the D-23 multi-layer experiment, where I ran a perfectly executed pipeline and every single metric was wrong.