Skip to main content

Embedding Models Don't Read Your Metadata (But They Should)

Split comparison showing YAML metadata as noise versus the same metadata as a natural language sentence producing a sharper embedding vector
· ~9 min read
Ryan Goodrich
Technical Writer, AI Enthusiast, and Developer Advocate

Here's a sentence your embedding model understands perfectly well:

"This is a canonical faction document from the post-Glitch era describing cultural practices and political structure."

And here's functionally identical information that your embedding model treats as random noise:

canon: true
domain: faction
era: post-glitch
topics: [culture, politics]

Same facts. Same document. Different embedding behavior. The YAML blob gets processed as four disconnected tokens with no semantic weight. The natural language sentence gets encoded as a rich set of contextual signals that tell the model what this document is, what it's about, and how it relates to the kind of questions someone might ask.

The gap between the metadata your system knows and the context your embeddings encode is the single biggest free improvement sitting in most RAG pipelines. Almost nobody exploits it.

The Problem With Metadata in Modern RAG

Every serious retrieval system has metadata. Document type, creation date, author, category, topic tags, authority level--the structural information that tells you what a document is as opposed to what it says. This metadata is valuable. Everyone agrees it's valuable. It powers filters, facets, access control, and sorting.

But here's the thing: your embedding model never sees it.

When you chunk a document, embed it, and store it in a vector store, the embedding is computed from the text content of the chunk. The metadata sits in a separate field--available for post-retrieval filtering, but invisible to the semantic similarity computation that determines what gets retrieved in the first place.

This means your retrieval system has two disconnected brains:

  1. The vector search brain knows what the document says but not what it is.
  2. The metadata filter brain knows what the document is. It only gets consulted after the vector search has already decided what's relevant.

The vector brain retrieves candidates based on semantic similarity. The filter brain then removes candidates that don't match the metadata criteria. This is fine for simple cases--"find documents about bears, but only from the fauna category." The vector search finds bear-related content, the filter removes anything that's not fauna.

But it fails for anything nuanced, because the filter can only remove candidates. It can't boost them. It can't say "this document is not just about bears, it's a canonical bestiary entry about bears, which is exactly what a question about bear taxonomy is looking for." That kind of reasoning requires the metadata to be inside the embedding--part of the semantic representation itself, not an afterthought applied post-retrieval.

What FractalRecall Actually Does

FractalRecall's approach is simple: take the metadata, convert it to a natural language sentence, and prepend it to the chunk before embedding.

That's it. That's the intervention.

A chunk that previously looked like this to the embedding model:

"The Rune-Bear's patrol cycle operates on a 22.4-hour rhythm, following territorial boundaries marked by degraded authentication beacons. When the beacons fire, the creature pauses, waits for a response that will never come, and resumes its circuit."

Now looks like this:

"[Canonical bestiary entry from the post-Glitch era, covering URSA-class autonomous fauna in the Asgard-Midgard border region. Authority: confirmed field observation, verified by Scriptorium-Primus.] The Rune-Bear's patrol cycle operates on a 22.4-hour rhythm, following territorial boundaries marked by degraded authentication beacons. When the beacons fire, the creature pauses, waits for a response that will never come, and resumes its circuit."

Same content. But the embedding model now encodes not just what the text says but what kind of document it comes from, how authoritative it is, and what domain it belongs to.

When someone queries "What are the most dangerous autonomous creatures in the Asgard border region?", the enriched chunk is a better semantic match--not because the text content changed, but because the metadata prefix tells the embedding model that this chunk is about autonomous fauna in the Asgard region, which is exactly what the query is asking about.

The embedding model can't read YAML. But it can read English. And it turns out that's all you need.

The Numbers (Briefly)

The D-22 experiment tested this approach with a single metadata layer (domain classification). NDCG@10 (ranking quality) improved 16.5%. Recall (the fraction of relevant documents actually found) jumped 27.3%. I've written about these numbers in detail, including the uncomfortable part where 43% of chunks silently overflowed the token limit and the metrics improved anyway.

The core thesis (metadata as co-embedded context) produces strong results. The engineering challenge of doing it without exceeding token limits, without creating prefix artifacts, and without drowning the actual content in structural boilerplate is where the real work lives. That's why this is still a research project.

Why This Works (The Intuition)

Embedding models are trained on natural language. They have strong priors about what sentences mean, how topics relate, and what kind of context modifies what kind of content. When you give them a sentence like "Canonical bestiary entry from the post-Glitch era," they encode:

  • Canonical → authoritative, primary source, verified
  • Bestiary entry → creature description, fauna, biological
  • Post-Glitch era → temporal context, specific period

These encoded signals create semantic bridges between the chunk and queries that use related language. A query about "authoritative sources on creatures" now has a shorter vector distance to this chunk--not because the creature description matches, but because the prefix matches.

This is not magic. It's leveraging something the model already knows how to do (process natural language context) and giving it context it didn't previously have access to.

It's the spine label on a library book. Without it, a librarian has to open every book and read every page to find what a patron needs. The metadata prefix doesn't change what's inside. It tells you what kind of book it is before you open it.

What This Means for Production Systems

If you're running a RAG pipeline in production, what does your embedding model actually know about your documents?

It knows the text content of each chunk. That's it. It has no concept of:

  • Document type. Is this a policy document? A product manual? A customer email?
  • Authority level. Is this the definitive source, or a draft?
  • Temporal context. Current or historical? Does it supersede something?
  • Where does this chunk fall in the document hierarchy? Section header, conclusion, appendix--the embedding has no idea.

All of this information exists in your system. Some of it lives in explicit metadata fields. Some of it could be inferred from the document structure. All of it is invisible to your embeddings.

FractalRecall's thesis is that making this information visible is the highest-leverage improvement most RAG systems aren't making. Convert it to natural language, include it in the embedding input, and the model can finally use what your system already knows.

The caveats are real:

  • Token limits matter. A 200-word prefix on a 300-word chunk means you're spending 40% of your embedding capacity on metadata rather than content. This is the overflow problem D-22 discovered the hard way.
  • Prefix quality matters. "This is a document" adds nothing. "Canonical policy document approved by Legal, superseding version 3.2, covering employee termination procedures for remote workers" adds a lot. The prefix needs to be specific, concise, and genuinely informative.
  • Schema design matters. You can't encode metadata you don't have. If your documents aren't tagged with domain, authority, temporal context, and structural position, you need to build that taxonomy first. The enrichment is only as good as the metadata it draws from.
  • Evaluation matters. (See: Your RAG Pipeline Has a Check Engine Light.) You need to measure whether the enrichment actually improves retrieval for your queries against your corpus. My 16.5% on Aethelgard is not your 16.5% on your data.

The Bigger Claim

The future of retrieval is not better embedding models. It's better embedding inputs.

The models are already very good at processing natural language. They understand topics, relationships, authority, temporality, and context--when you give them that information in a form they can process. The bottleneck isn't the model's capability. It's the information we choose to give it.

Every document in your system has structural context that the embedding model can't access. Making that context accessible is a simple intervention with outsized impact. It doesn't require a new model, a new architecture, or a new infrastructure. It requires a string concatenation and a willingness to rethink what "the document" means when you hand it to an embedding function.

The metadata is already there. The model is already capable of understanding it.

The only thing missing is the sentence.


FractalRecall is an active research project exploring metadata as co-embedded context for retrieval. For experiment details, see the project page. For the story of what happens when enrichment goes wrong, see I Added Context to My Embeddings and 43% of My Data Disappeared. For the evaluation framework that keeps this research honest, see Your RAG Pipeline Has a Check Engine Light.