Why do embedding models ignore YAML metadata?

Embedding models are trained on natural language. YAML structured metadata is not natural language--it's syntax. The model treats YAML tags and keys as disconnected tokens without semantic weight. Natural language sentences expressing the same information encode rich contextual signals.

How much does converting metadata to natural language improve retrieval?

In the FractalRecall D-22 experiment, converting YAML metadata to a natural language prefix sentence improved NDCG@10 by 16.5% and recall by 27.3%. The prefix was just 24 tokens--roughly the same length as one sentence.

What information should be included in a metadata enrichment prefix?

The prefix should include document type, authority level, temporal context, and organizational structure. Effective prefixes are specific and informative--not generic statements, but details like 'canonical policy document approved by Legal, superseding version 3.2, covering remote worker termination'.

What evaluation criteria should a RAG pipeline use?

A robust evaluation framework should check seven criteria: aggregate improvement, statistical significance, query degradation rate, per-type performance breakdown, effect size, overflow/data loss tracking, and consistency across runs. A majority must pass for GO decision, with hard vetoes on degradation and overflow.

Why is query degradation rate important in RAG evaluation?

Aggregate metrics can hide widespread damage. If 40% of individual queries degrade while the average improves, the system is broken for those queries. Tracking per-query degradation reveals when improvement in some areas masks failures in others.

How can statistical significance be misleading in retrieval research?

A statistically significant p-value indicates whether two distributions are different, not whether results are correct or meaningful. The D-23 experiment produced a near-zero p-value showing all metrics were zero due to a code bug. The statistical test was correct while the result was completely wrong.

Why do standard RAG systems lose document structure?

Standard RAG chunks documents into orphan fragments that know what they say but not what they are. The chunking process destroys the organizational intelligence that human authors built in, such as document type, chapter structure, hierarchy, and metadata relationships.

How does treating documents as databases improve retrieval?

By preserving document structure through all stages — publishing, validation, embedding, and compression — the system maintains organizational intelligence that improves semantic search. Structure acts as an index and enrichment layer that standard pipelines discard.

What are the five projects in the document-as-database pipeline?

LlmsTxtKit (parsing structured Markdown), DocStratum (validating structure), FractalRecall (embedding with structural context), Haiku Protocol (compressing content while preserving structure), and Chronicle (version-controlling documents with metadata). Together they form a document-centric AI pipeline.

What happens when embedding input exceeds the token limit?

Embedding models have a maximum token budget per input. When text exceeds this limit, chunks are silently dropped from the index without error messages. In the FractalRecall D-22 experiment, adding a 24-token metadata prefix caused 43% of chunks to overflow, with no warning or notification that data was being discarded.

Why did retrieval improve despite losing 43% of the data?

The surviving chunks had better semantic representation from the metadata prefix, which improved ranking quality by 16.5% and recall by 27.3%. However, the improvement may also be partly due to accidental data cleaning if the overflowed chunks were lower quality, making it impossible to fully separate the enrichment benefit from the data loss benefit.

How should token budget be allocated for metadata enrichment?

Token budgets must account for both enrichment prefixes and content. The D-23 solution uses a prefix_reserve mechanism that pre-allocates tokens for metadata, guaranteeing the content never exceeds the remaining space. This prevents silent overflow at the cost of accepting smaller chunk sizes.

Does Google use llms.txt files?

As of February 2026, at least five Google developer documentation properties host llms.txt files: ai.google.dev, developer.chrome.com, firebase.google.com, google.github.io/adk-docs, and web.dev. Google's ADK Python repository also includes an AGENTS.md file that explicitly directs AI coding assistants to use llms.txt for context. This is notable because Google executives publicly dismissed llms.txt in 2025.

Did Google change its position on llms.txt?

No official reversal has been announced. Google Search executives John Mueller and Gary Illyes publicly rejected llms.txt in April and July 2025, respectively. However, Google's developer documentation teams have quietly implemented llms.txt files across multiple properties, suggesting a gap between executive-level policy and developer-level practice.

What is AGENTS.md and how does it relate to llms.txt?

AGENTS.md is a file in Google's ADK Python repository that provides context for AI coding assistants like Claude Code, Gemini CLI, and GitHub Copilot. Its Additional Resources section explicitly lists 'llms.txt (summarized)' and 'llms-full.txt (comprehensive)' as LLM context sources, making it a direct instruction for AI agents to use llms.txt at inference time.

What is the 'Lost in the Middle' problem with LLM context windows?

Research from Stanford, Berkeley, and others shows that LLMs struggle with information placed in the middle of long contexts. Models attend well to the beginning and end of their context window, but reliability drops significantly for information in the middle. This means a 128K context window is functionally much smaller for reliable retrieval and reasoning tasks.

How does metadata enrichment improve retrieval quality?

In the FractalRecall D-22 experiment, prepending a 24-token natural-language metadata sentence to each chunk before embedding improved retrieval quality (NDCG@10) by 16.5% and recall by 27.3%. This suggests that a small amount of the right context can outperform a large amount of raw text.

What is semantic compression in the context of LLMs?

Semantic compression (as explored by Haiku Protocol) is the systematic transformation of verbose, human-friendly prose into dense, machine-optimized strings that preserve the same information in fewer tokens. Unlike summarization (which loses detail) or truncation (which loses endings), compression aims to retain all information while reducing token count.

DocStratum is a documentation quality platform for llms.txt files that validates structural compliance, content quality, and best practices across an entire AI-documentation ecosystem. It operates across five validation levels (L0–L4), ranging from basic parseability to advanced quality optimization for LLM consumption, and includes 38 diagnostic codes across three severity levels, 22 named anti-patterns, and an 11-name canonical section vocabulary.

How much of DocStratum is based on the llms.txt specification?

Only 11.5% of DocStratum's 52 audited validation items (6 items) are directly spec-compliant, meaning they match behaviors defined in the llms.txt specification and its reference parser. Another 9.6% (5 items) are spec-implied. The remaining 78.8% (41 items) are DocStratum extensions: opinionated quality criteria that go beyond the spec.

Why does DocStratum have so many extensions beyond the llms.txt spec?

The llms.txt spec is deliberately minimal. Its reference parser is roughly 20 lines of Python regex. It defines a file format, not a quality framework. DocStratum's extensions fill the gap between 'does this file parse?' and 'does this file actually help an AI system understand your documentation?' That gap is where almost all the value lives.

What are DocStratum's anti-patterns?

DocStratum catalogs 22 named anti-patterns across four categories: Critical (Ghost File, Structure Chaos, Encoding Disaster, Link Void), Structural (Sitemap Dump, Orphaned Sections, Duplicate Identity, Section Shuffle, Naming Nebula), Content (Copy-Paste Plague, Blank Canvas, Jargon Jungle, Link Desert, Outdated Oracle, Example Void, Formulaic Description, Silent Agent, Versionless Drift), and Strategic (Automation Obsession, Monolith Monster, Meta-Documentation Spiral, Preference Trap).

Why use different writing voices for the same research?

Different audiences need different registers. A blog post that reads like an academic paper loses casual readers. A research paper that reads like a blog post loses peer reviewers. A technical guide that reads like either loses practitioners who need actionable steps. Using three distinct voices — opinionated blog, neutral guide, impartial paper — lets the same underlying research reach all three audiences without compromising any of them.

Does using humor in blog posts hurt research credibility?

Not if the research itself is rigorous. The blog voice (personality, sarcasm, first-person narrative) functions as an accessibility layer that draws readers into research they might not otherwise engage with. Credibility comes from the evidence methodology, the source attribution, and the analytical rigor — not from the tone. Readers who want the impartial version can read the paper. Readers who want the practical version can read the guide.

What are the three content layers in a technical research project?

The three layers are: (1) Blog posts — first-person, opinionated, voice-forward, designed to tell the story and draw readers in; (2) Technical guides — neutral, procedural, audience-agnostic, designed to help practitioners solve problems; (3) Analytical paper — impartial, evidence-driven, register-neutral, designed to survive peer review and contribute to scholarly discourse. All three draw from the same evidence base but use different registers, structures, and editorial standards.

What is an evidence inventory?

An evidence inventory is a structured document that catalogs every factual claim in a piece of writing, maps each claim to a specific primary source, tracks verification status (verified, partially verified, unverified, incorrect), and records what was actually found during research. It's the documentation equivalent of showing your work in a math exam.

Why should technical writers build evidence inventories?

Evidence inventories catch errors before they become published facts that other people cite. In one case, a 49-claim inventory for an llms.txt research paper caught three significant errors: an inflated adoption statistic (844,000 vs. the actual ~784 directory entries), a timeline conflation merging two separate events a year apart, and an unverifiable secondhand claim. All three would have gone to print without the inventory.

How do you build an evidence inventory for a research paper?

Start by extracting every factual claim from your outline or draft. Assign each claim an ID tied to its section. For each claim, identify the primary source, attempt independent verification through web research or primary data, record what you actually found (not what you expected to find), and assign a status: verified, partially verified, unverified, incorrect, or author analysis. The inventory becomes a living document you update as research progresses.

Blog | southpawriter

The Quiet Part

March 17, 2026 · ~5 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

You may have noticed it got quiet around here.

No farewell post. No "exciting announcement." No carefully worded "I'm pivoting to..." thread with a blue-sky emoji. Just... silence. Three weeks of it, which in blog-time is roughly equivalent to leaving a shopping cart in the middle of the grocery aisle and walking out of the store.

I owe you an explanation. Or, more accurately, I don't owe you anything, but I'm going to give you one anyway because I spent the last three weeks staring at my projects and feeling nothing, and it turns out that's worth talking about.

Embedding Models Don't Read Your Metadata (But They Should)

February 26, 2026 · ~9 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

Here's a sentence your embedding model understands perfectly well:

"This is a canonical faction document from the post-Glitch era describing cultural practices and political structure."

And here's functionally identical information that your embedding model treats as random noise:

canon: true
domain: faction
era: post-glitch
topics: [culture, politics]

Same facts. Same document. Different embedding behavior. The YAML blob gets processed as four disconnected tokens with no semantic weight. The natural language sentence gets encoded as a rich set of contextual signals that tell the model what this document is, what it's about, and how it relates to the kind of questions someone might ask.

The gap between the metadata your system knows and the context your embeddings encode is the single biggest free improvement sitting in most RAG pipelines. Almost nobody exploits it.

Your RAG Pipeline Has a Check Engine Light. You're Ignoring It.

February 25, 2026 · ~10 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

I ran a retrieval experiment that returned perfect zeros across all 36 queries, and every automated check I'd built said "statistically significant." The decision engine considered seven criteria, passed two of them, and issued a NO-GO. The pipeline caught the problem. Not me--the pipeline.

Here's what scares me: most production RAG systems don't have a pipeline like that. They don't have decision criteria. They don't have rollback thresholds. They don't have a concept of "this retrieval result is wrong and we should know about it automatically." They ship a model, run some spot checks, and move on to the next sprint.

Your RAG pipeline has a check engine light. You just never installed it.

Five Projects, One Realization: The Document Is the Database

February 24, 2026 · ~8 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

I didn't plan a portfolio. I planned a Markdown file. Then another one. Then five projects materialized around them like ice crystals on a cold window, each shaped by the same principle I didn't recognize until project number four. Apparently I need to build the same insight multiple times before I notice I keep building it.

The insight: documents are not content delivery vehicles. They are structured knowledge systems. Almost every AI tool in production today throws away the structure and keeps only the content. That's like buying a filing cabinet, dumping all the folders on the floor, and asking someone to find last quarter's tax return by feeling the texture of the paper.

I know this because I've now built five projects that all, in their own way, try to fix that mistake.

I Added Context to My Embeddings and 43% of My Data Disappeared

February 23, 2026 · ~7 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

In Part 1, I mentioned the D-22 experiment almost as an aside. Twenty-four tokens of metadata prefix, 16.5% improvement in ranking quality, 27.3% recall jump. Good numbers. Clean story.

I left out the part where 43% of my data vanished.

Not "performed poorly." Not "returned lower-quality results." Vanished. Ninety-four of 218 chunks silently dropped from the index because I added one sentence of context and didn't do the arithmetic on what that sentence would cost. The embedding pipeline didn't warn me. ChromaDB didn't complain. I only noticed because I'm the kind of person who checks row counts after every insert. (This is not a personality trait. It's scar tissue.)

The results improved anyway. That's the part I need to explain.

Google Said No to llms.txt. Five Google Teams Didn't Get the Memo.

February 22, 2026 · ~10 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

The timeline is where the joke lives.

April 2025. Google's John Mueller compares llms.txt to the keywords meta tag. For the uninitiated, the keywords meta tag is so discredited that invoking it in SEO circles is equivalent to recommending bloodletting at a medical conference. Mueller's message was clear: llms.txt is unnecessary, self-reported data that Google has no intention of using.

July 2025. Gary Illyes, also from Google's Search team, confirms the position at Search Central Live. No support. Won't be used. Normal SEO works fine for AI Overviews. The standard is, officially, not something Google is interested in.

December 3, 2025. An SEO professional named Lidia Infante discovers an llms.txt file on Google's own Search Central documentation. Mueller's response, posted to Bluesky: "hmmn :-/". The file was removed within hours.

So far, a clean narrative. Google said no, someone at Google accidentally deployed one, it was caught and deleted, and the official position holds. Embarrassing, but coherent.

Then I started pulling at threads.

Context Windows Are a Lie (And Haiku Protocol Is My Coping Mechanism)

February 21, 2026 · ~10 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

LLM vendors would like you to know that their latest model supports a 128,000-token context window. Some of them say 200,000. One of them, and I won't name names but their logo is a little sunset, says a million. A million tokens. That's approximately four copies of War and Peace, which is appropriate because trying to get useful work done at the far end of a million-token window is its own kind of Russian tragedy.

Here's what the marketing materials don't mention: the effective context window, the portion where the model actually pays reliable attention to what you put there, is dramatically smaller. Research from Stanford, Berkeley, and others has converged on a finding that would be funny if it weren't costing people real money: models struggle with information placed in the middle of long contexts. They're great at the beginning. They're decent at the end. The middle? The middle is where facts go to die quietly, unnoticed, like a footnote in a terms of service agreement.

This is the "Lost in the Middle" problem, and if you're building anything that retrieves information and feeds it to a language model (which, in 2026, is approximately everyone) it means the number on the tin is a fantasy. Your 128K window is functionally an 8K window with 120K tokens of expensive padding.

I know this because I ran the experiment. Accidentally. Three times.

78.8% of My Validator Is Made Up (And That's the Point)

February 20, 2026 · ~16 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

I recently did something that most software developers would consider either admirably honest or clinically inadvisable: I audited my own tool against the specification it claims to implement, wrote down the results in excruciating detail, and published them.

The tool is DocStratum, a documentation quality platform for llms.txt files. The project started with a thesis that most people in the AI tooling space either haven't considered or don't want to hear: a Technical Writer with strong Information Architecture skills can outperform a sophisticated RAG pipeline by simply writing better source material. Structure is a feature. DocStratum exists to prove it.

At its core, DocStratum is a validation framework — think ESLint, but for a Markdown standard defined by a blog post instead of a formal grammar. It checks your llms.txt file across five validation levels: basic parseability (L0), structural compliance (L1), content quality (L2), best practices (L3), and a full extended-quality tier (L4). It categorizes findings across 38 diagnostic codes using three severity levels (Error, Warning, Info). It detects anti-patterns — 22 of them, with names like "The Ghost File," "The Monolith Monster," and "The Preference Trap." It has opinions.

Those opinions, it turns out, are almost entirely our own invention. (Good.)

The Three Voices of Technical Research: Why My Blog Sounds Nothing Like My Paper

February 19, 2026 · ~10 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

Someone recently asked me a question that I've been thinking about ever since: "Doesn't writing your blog posts with humor and sarcasm undermine your credibility as a researcher?"

It's a fair question. The blog posts on this site are... aggressively me. I compare WAF blocking to "hiring a security guard who prevents anyone matching the physical description of 'reads books' from entering the bookstore." I describe AI crawlers as looking like "a DDoS attack with a liberal arts degree." I write sentences like "I am a documentation-first developer with a research compulsion and a growing collection of Markdown files about Markdown files," and then I publish those sentences on the internet where potential collaborators can see them.

Meanwhile, the analytical paper I'm writing about the same research uses phrases like "the structural misalignment between content publication intent and infrastructure-level access enforcement." Which is the same observation as the bookstore metaphor, expressed in the register of someone who wants to be taken seriously at a conference.

Same research. Same data. Same conclusions. Radically different voices. And I'd argue that if I used only one of those voices everywhere, the whole project would be worse.

I Fact-Checked My Own Research Paper Before Writing It (You Should Too)

February 18, 2026 · ~11 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

Here's a workflow tip that's either going to save your credibility or confirm that I have an unhealthy relationship with spreadsheets: before you write anything that makes factual claims, build an evidence inventory first.

Not a bibliography. Not a "sources" section at the bottom of a Google Doc. An actual structured inventory where every single factual claim in your paper, blog post, report, or conference talk is cataloged, mapped to a primary source, independently verified, and assigned a status. Verified. Partially verified. Unverified. Or the one that makes your stomach drop: incorrect.

I know this sounds like the kind of advice that belongs on a poster in a university writing center, sandwiched between "cite your sources" and "plagiarism is bad." But I'm not talking about academic hygiene. I'm talking about self-defense.