Skip to main content

Project banner

DocStratum

DocStratum is a validation tool for llms.txt files. It checks whether a file conforms to the specification, reports compliance issues with actionable details, and categorizes errors by severity. The name comes from geological stratigraphy: DocStratum examines the layers of a document to see if they're structurally sound.

The Problem It Solves

The llms.txt specification was published as a blog post by Jeremy Howard, not as an RFC or formal grammar. Ambiguities, edge cases, interpretation questions that don't have definitive answers. When hundreds of sites adopt a standard that informal--with notable names like Anthropic and Cloudflare alongside smaller implementers--compliance varies wildly.

A geological core sample of an llms.txt file — each layer validated independently.

DocStratum brings structure to that ambiguity. It encodes one specific, documented interpretation of the spec and validates against it consistently. Where the spec is ambiguous, DocStratum makes a judgment call and documents it. The goal isn't to be the "official" validator; it's to be a consistent, explainable one.

How It Differs from LlmsTxtKit's Validation

LlmsTxtKit includes validation as one capability in a larger pipeline (parse → fetch → validate → cache → generate context). DocStratum is focused and standalone, with deeper analysis:

  • LlmsTxtKit validation answers: "Is this file valid enough to use?" A gate in a pipeline.
  • DocStratum validation answers: "How compliant is this file, what exactly is wrong, and how should it be fixed?" An analysis tool.

Think of LlmsTxtKit as a compiler's syntax check--enough to proceed. DocStratum is the linter: opinionated, detailed, designed to improve quality over time.

What It Checks

DocStratum validates across several categories:

A compliance heatmap showing validation results across multiple sites and rule categories.

Structure: does the file have the required sections? Is the Markdown well-formed? Are headers at the correct levels? Links: are URLs well-formed, HTTPS, non-duplicate, with descriptions? Metadata: is the title present and non-empty, are descriptions meaningful (not just the URL repeated)? Spec compliance: does the file follow the explicit rules, avoid the discouraged patterns, and where the spec is silent, DocStratum applies documented defaults.

Planned: Corpus Scanning

One of the planned blog posts involves running DocStratum against a corpus of 100+ real-world llms.txt files and reporting the results. What percentage are fully compliant? What are the most common errors? Are there patterns by industry or stack? The tooling for corpus scanning is part of DocStratum's roadmap.

Editorial annotations on a document — the kind of markup DocStratum's validation produces.

Where to Find It