How many websites have actually adopted llms.txt?

As of early 2026, the best available data shows roughly 784 sites listed in the llmstxt.site directory, with only 105 sites found in the Majestic Million (top 1 million websites by traffic). The widely cited '844,000+' figure has no verifiable primary source and appears to be a misinterpretation or aggregation error.

Where did the 844,000 llms.txt adoption number come from?

The origin of the '844,000+' figure could not be traced to a credible primary source during independent verification. Community directories list hundreds to low thousands of entries, and independent crawls of top websites show adoption rates below 0.02%. The number appears to have circulated through secondary reporting without a verifiable methodology behind it.

What does actual llms.txt adoption look like?

Adoption is real but narrow. Community directories list several hundred implementations, mostly concentrated in developer documentation and tech companies. Chris Green's Majestic Million analysis found 105 sites out of the top million (0.011%), with zero adoption in the top 1,000 websites by traffic.

How widely adopted is llms.txt?

Community directories list hundreds of implementations, but cross-referencing against the Majestic Million shows only 105 sites (0.011% of the top million) have llms.txt files, with zero in the top 1,000. Adoption is concentrated almost entirely in developer documentation and tech companies.

Do AI systems use llms.txt at inference time?

No major AI provider has publicly confirmed using llms.txt at inference time. Google explicitly rejected the standard, and server log evidence is more consistent with training-time data collection than real-time retrieval. The spec was designed for inference, but the data doesn't show that happening.

Why do Cloudflare's AI crawler settings conflict with each other?

Cloudflare has three overlapping control layers: the AI Audit dashboard, AI Crawl Control categories, and WAF Custom Rules. WAF custom rules execute before AI Crawl Control settings, meaning a security rule can override Cloudflare's own AI-specific toggles without the site operator realizing it.

Is the llms.txt Access Paradox just a configuration error?

No, it's a structural problem. The llms.txt standard assumes frictionless access between 'file published' and 'file consumed by AI.' But Cloudflare alone sits in front of 20% of all websites, and all major WAF providers treat AI crawlers as threats by default. The burden falls entirely on site operators to become WAF experts.

What is the llms.txt Access Paradox?

The llms.txt Access Paradox is the structural conflict where a site publishes an llms.txt file for AI systems, but the site's own Web Application Firewall (WAF) blocks every AI crawler from reading it, because AI crawlers are indistinguishable from malicious bots to bot-detection systems.

Why do WAFs block AI crawlers?

AI crawlers don't execute JavaScript, don't maintain cookies, originate from data center IPs, use non-browser user agents like GPTBot/1.0, and make single stateless GET requests. Each signal matches malicious bot behavior, making them indistinguishable from threats to a WAF.

How can I test if my WAF blocks AI crawlers from reading my llms.txt?

Run a curl command with an AI crawler user agent: curl -sI -H 'User-Agent: GPTBot/1.0' https://yoursite.com/llms.txt. If you get a 403 or a JavaScript challenge page instead of your Markdown content, your WAF is blocking AI access.

Do site operators know when their WAF blocks AI crawlers?

No. WAFs don't send alerts when they block ClaudeBot or GPTBot. The llms.txt file looks fine in a browser. The only way to discover the block is to test with AI-representative user agents, which most site operators never do.

Blog | southpawriter

The 844,000 Sites That Weren't: How an AI Adoption Stat Fell Apart Under Scrutiny

February 17, 2026 · ~10 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

I need to tell you about a number. It's a number that shows up in blog posts and LinkedIn threads and conference talks and those AI trend reports that get passed around Slack channels like contraband. The number is 844,000, and it refers to the number of websites that have supposedly adopted the llms.txt standard.

I encountered this number while building the evidence inventory for an analytical paper about llms.txt (the Markdown-based content discovery format proposed by Jeremy Howard in September 2024). Because I am the kind of person who builds evidence inventories before writing papers, the kind of person who catalogs every factual claim and traces it back to a primary source before committing a single sentence to a draft, I decided to verify it.

I should not have done this on a weeknight. The verification process involved what I can only describe as the five stages of grief, but for statistics.

The llms.txt Access Paradox: The Data Nobody Wants to Hear

February 16, 2026 · ~14 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

In Part 1, I told the story of discovering that my own hosting infrastructure was blocking AI crawlers from reading the llms.txt file I'd specifically published for them. A Web Application Firewall (WAF), the security layer that inspects every inbound HTTP request, can't tell the difference between "AI system reading curated content as intended" and "malicious bot probing endpoints for vulnerabilities," and the result is a paradox that would be hilarious if it weren't also my actual production environment.

That was the personal version, the "I discovered this at 11 PM and said words I can't publish on a professional blog" version. This is the systemic version. The one where I pull at the thread and the whole sweater starts to unravel.

Because once I started asking "how widespread is this?", the answers didn't just confirm the WAF problem. They complicated the entire premise of what llms.txt is supposed to do. And I mean the entire premise.

I Tried to Help AI Read My Website. My Own Firewall Said No.

February 15, 2026 · ~11 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

I did everything right. I wrote the file. I followed the spec. I deployed it to production. I even tested it in my browser: clean Markdown rendering, proper H2 sections, curated links with useful descriptions. My llms.txt file was, and I say this without hyperbole, the best piece of structured content I had ever placed at a root URL. I was proud of that file, in the way that only a documentation-first developer can be proud of a Markdown file that nobody has read.

Then an AI system tried to read it, and my own infrastructure said no.

Not a polite "no, sorry, you don't have permission." Not even a helpful "no, that file doesn't exist." The kind of no where Cloudflare intercepts the request before it touches my server, decides the visitor looks suspicious on the basis of (and I love this) being exactly the kind of visitor the file was created for, and serves a JavaScript challenge page instead. To the AI crawler, my lovingly curated Markdown might as well not exist. In its place: a blob of obfuscated HTML designed to prove the visitor is human. Which, by definition, the AI crawler is not. Nor does it aspire to be. That's the entire point.

Welcome to what I've started calling the llms.txt Access Paradox: the structural conflict between publishing content for AI systems and running the security infrastructure that blocks them. It's the kind of problem that makes you close your laptop, open it again, and start writing a research paper instead of just a blog post.

I Write the Docs Before the Code, and Yes, I Know That's Weird

February 13, 2026 · ~10 min read

Ryan Goodrich

Technical Writer, AI Enthusiast, and Developer Advocate

I have a confession to make. When I start a new project, any project, doesn't matter what it is, the first thing I do is open a Markdown file and start writing documentation for something that doesn't exist yet.

Not code. Not a prototype. Not even a to-do list. Documentation.

I realize this makes me sound like the kind of person who reads the terms of service before clicking "I Agree." I promise I'm not. (I absolutely am.)