Why do embedding models ignore YAML metadata?

Embedding models are trained on natural language. YAML structured metadata is not natural language--it's syntax. The model treats YAML tags and keys as disconnected tokens without semantic weight. Natural language sentences expressing the same information encode rich contextual signals.

How much does converting metadata to natural language improve retrieval?

In the FractalRecall D-22 experiment, converting YAML metadata to a natural language prefix sentence improved NDCG@10 by 16.5% and recall by 27.3%. The prefix was just 24 tokens--roughly the same length as one sentence.

What information should be included in a metadata enrichment prefix?

The prefix should include document type, authority level, temporal context, and organizational structure. Effective prefixes are specific and informative--not generic statements, but details like 'canonical policy document approved by Legal, superseding version 3.2, covering remote worker termination'.

What evaluation criteria should a RAG pipeline use?

A robust evaluation framework should check seven criteria: aggregate improvement, statistical significance, query degradation rate, per-type performance breakdown, effect size, overflow/data loss tracking, and consistency across runs. A majority must pass for GO decision, with hard vetoes on degradation and overflow.

Why is query degradation rate important in RAG evaluation?

Aggregate metrics can hide widespread damage. If 40% of individual queries degrade while the average improves, the system is broken for those queries. Tracking per-query degradation reveals when improvement in some areas masks failures in others.

How can statistical significance be misleading in retrieval research?

A statistically significant p-value indicates whether two distributions are different, not whether results are correct or meaningful. The D-23 experiment produced a near-zero p-value showing all metrics were zero due to a code bug. The statistical test was correct while the result was completely wrong.

What happens when embedding input exceeds the token limit?

Embedding models have a maximum token budget per input. When text exceeds this limit, chunks are silently dropped from the index without error messages. In the FractalRecall D-22 experiment, adding a 24-token metadata prefix caused 43% of chunks to overflow, with no warning or notification that data was being discarded.

Why did retrieval improve despite losing 43% of the data?

The surviving chunks had better semantic representation from the metadata prefix, which improved ranking quality by 16.5% and recall by 27.3%. However, the improvement may also be partly due to accidental data cleaning if the overflowed chunks were lower quality, making it impossible to fully separate the enrichment benefit from the data loss benefit.

How should token budget be allocated for metadata enrichment?

Token budgets must account for both enrichment prefixes and content. The D-23 solution uses a prefix_reserve mechanism that pre-allocates tokens for metadata, guaranteeing the content never exceeds the remaining space. This prevents silent overflow at the cost of accepting smaller chunk sizes.

3 posts tagged with "RAG"

Retrieval-augmented generation pipelines, chunking strategies, retrieval quality, and grounding LLM responses in source material.

View All Tags

Embedding Models Don't Read Your Metadata (But They Should)

February 26, 2026 · ~9 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

Here's a sentence your embedding model understands perfectly well:

"This is a canonical faction document from the post-Glitch era describing cultural practices and political structure."

And here's functionally identical information that your embedding model treats as random noise:

canon: true
domain: faction
era: post-glitch
topics: [culture, politics]

Same facts. Same document. Different embedding behavior. The YAML blob gets processed as four disconnected tokens with no semantic weight. The natural language sentence gets encoded as a rich set of contextual signals that tell the model what this document is, what it's about, and how it relates to the kind of questions someone might ask.

The gap between the metadata your system knows and the context your embeddings encode is the single biggest free improvement sitting in most RAG pipelines. Almost nobody exploits it.

Your RAG Pipeline Has a Check Engine Light. You're Ignoring It.

February 25, 2026 · ~10 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

I ran a retrieval experiment that returned perfect zeros across all 36 queries, and every automated check I'd built said "statistically significant." The decision engine considered seven criteria, passed two of them, and issued a NO-GO. The pipeline caught the problem. Not me--the pipeline.

Here's what scares me: most production RAG systems don't have a pipeline like that. They don't have decision criteria. They don't have rollback thresholds. They don't have a concept of "this retrieval result is wrong and we should know about it automatically." They ship a model, run some spot checks, and move on to the next sprint.

Your RAG pipeline has a check engine light. You just never installed it.

I Added Context to My Embeddings and 43% of My Data Disappeared

February 23, 2026 · ~7 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

In Part 1, I mentioned the D-22 experiment almost as an aside. Twenty-four tokens of metadata prefix, 16.5% improvement in ranking quality, 27.3% recall jump. Good numbers. Clean story.

I left out the part where 43% of my data vanished.

Not "performed poorly." Not "returned lower-quality results." Vanished. Ninety-four of 218 chunks silently dropped from the index because I added one sentence of context and didn't do the arithmetic on what that sentence would cost. The embedding pipeline didn't warn me. ChromaDB didn't complain. I only noticed because I'm the kind of person who checks row counts after every insert. (This is not a personality trait. It's scar tissue.)

The results improved anyway. That's the part I need to explain.