Skip to main content

2 posts tagged with "Embeddings"

Vector embeddings, embedding models, and the information theory behind how meaning gets compressed into numbers.

View All Tags

Embedding Models Don't Read Your Metadata (But They Should)

Split comparison showing YAML metadata as noise versus the same metadata as a natural language sentence producing a sharper embedding vector
· ~9 min read
Ryan Goodrich
Technical Writer, AI Enthusiast, and Developer Advocate

Here's a sentence your embedding model understands perfectly well:

"This is a canonical faction document from the post-Glitch era describing cultural practices and political structure."

And here's functionally identical information that your embedding model treats as random noise:

canon: true
domain: faction
era: post-glitch
topics: [culture, politics]

Same facts. Same document. Different embedding behavior. The YAML blob gets processed as four disconnected tokens with no semantic weight. The natural language sentence gets encoded as a rich set of contextual signals that tell the model what this document is, what it's about, and how it relates to the kind of questions someone might ask.

The gap between the metadata your system knows and the context your embeddings encode is the single biggest free improvement sitting in most RAG pipelines. Almost nobody exploits it.

I Added Context to My Embeddings and 43% of My Data Disappeared

Terminal showing embedding pipeline results: 218 chunks input, 124 surviving, 94 silently dropped, with retrieval metrics improving despite data loss
· ~7 min read
Ryan Goodrich
Technical Writer, AI Enthusiast, and Developer Advocate

In Part 1, I mentioned the D-22 experiment almost as an aside. Twenty-four tokens of metadata prefix, 16.5% improvement in ranking quality, 27.3% recall jump. Good numbers. Clean story.

I left out the part where 43% of my data vanished.

Not "performed poorly." Not "returned lower-quality results." Vanished. Ninety-four of 218 chunks silently dropped from the index because I added one sentence of context and didn't do the arithmetic on what that sentence would cost. The embedding pipeline didn't warn me. ChromaDB didn't complain. I only noticed because I'm the kind of person who checks row counts after every insert. (This is not a personality trait. It's scar tissue.)

The results improved anyway. That's the part I need to explain.