Why AI Crawlers Get Blocked (and What You Can Do About It)
You've done everything right. You've created an llms.txt file. You've written useful descriptions for every link. You've deployed it to production. You've even told people about it on social media, which took a level of confidence you usually reserve for parallel parking in front of an audience.
And then an AI system tries to read it, and your own infrastructure says no.
Welcome to the WAF paradox: the tragicomic collision between "please read my curated content" and "all non-human visitors shall be treated as hostile entities." This guide explains what's happening, why it's happening, how to diagnose it, and what you can actually do about it.
Who This Guide Is For
This guide is written for three audiences:
- Site operators who have published (or want to publish) an llms.txt file and need to ensure AI systems can actually access it
- Developers building tools that fetch llms.txt files or other structured content from the web (you need to understand the landscape of what's blocking you)
- Anyone curious about the infrastructure layer that sits between "the open web" and "AI systems that read the open web," because that layer is thicker, weirder, and more consequential than most people realize
Prerequisites
- Basic understanding of HTTP requests and responses (status codes, headers)
- Familiarity with what an llms.txt file is (the glossary entry covers the essentials)
- Access to a terminal with
curl(optional but extremely useful for diagnostics)
No security expertise required. This guide assumes you know what a 403 status code means but not necessarily what a JavaScript challenge page is. That's fine. We'll get there.
Part 1: What WAFs Are (and Why They're Not Going Away)
The Security Layer You Forgot Was There
A Web Application Firewall sits between the public internet and your web server. Every HTTP request that arrives at your site passes through the WAF first. The WAF inspects it (the headers, the IP address, the request pattern, the user agent, possibly a JavaScript challenge result) and decides: legitimate visitor, or threat?
This isn't optional security theater. WAFs block real attacks: SQL injection, cross-site scripting, credential stuffing, DDoS floods, and the daily background radiation of bots probing every exposed endpoint on the internet. The web is a hostile environment, and WAFs are the bouncers.
The problem is that bouncers can't read invitations.
The Major Players
The WAF market is dominated by a handful of providers. The specifics of how each one handles AI crawlers vary, but the underlying detection logic is similar:
Cloudflare. Sits in front of roughly 20% of all public websites. In July 2025, they launched what they called an "AI Audit" dashboard, giving site owners one-click controls to block AI bots. On new domains, AI bot blocking became the default. They called the rollout "AIndependence Day," which, look, I respect the commitment to a pun. They also publish a continuously updated list of known AI bot signatures that their WAF can automatically intercept.
Akamai. Runs their Bot Manager product, which uses behavioral analysis, browser fingerprinting, and IP reputation scoring. Akamai's approach focuses heavily on whether the client behaves like a browser, not just whether it claims to be one. AI crawlers, which typically fire a single GET request and vanish, look nothing like a human browsing session. Akamai notices.
AWS WAF (+ Shield + CloudFront). Amazon's layered defense stack. AWS WAF supports managed rule sets, including bot-control rules, that can be applied to CloudFront distributions and API Gateway endpoints. The bot-control rules categorize traffic by signal analysis: known bot signatures, IP reputation, request tokens. AI crawlers typically fall into the "automated traffic" bucket, which gets the same treatment as scraping bots unless explicitly allowlisted.
Fastly. Uses their "Next-Gen WAF" (built on the Signal Sciences acquisition) with a signal-scoring approach. Requests accumulate signals from different detectors (IP reputation, request anomalies, user agent classification) and a threshold determines whether to block. AI crawlers tend to accumulate enough signals to trip the threshold even when each individual signal wouldn't be sufficient alone.
Sucuri, Imperva (Incapsula), Vercel. All operate WAF products with bot-detection capabilities that can affect AI crawler access. Vercel is worth noting specifically because many Docusaurus and Next.js sites are deployed there.
How WAFs Detect Bots
Bot detection isn't a single check. It's a layered system, and AI crawlers fail almost every layer:
| Detection Method | What It Checks | AI Crawlers Fail Because... |
|---|---|---|
| User Agent Analysis | Does the request identify as a known browser? | AI crawlers honestly identify as GPTBot, ClaudeBot, anthropic-ai, Google-Extended, etc. Honesty is punished. |
| JavaScript Challenges | Can the client execute JS and return a valid response? | AI crawlers don't run JavaScript. They make raw HTTP requests. Challenge pages return a CAPTCHA or blank response. |
| TLS Fingerprinting | Does the TLS handshake look like a real browser? | HTTP client libraries (like HttpClient in .NET or requests in Python) produce fingerprints distinct from Chrome, Firefox, or Safari. |
| IP Reputation | Does the request come from a data center IP range? | AI crawlers run on cloud infrastructure. Data center IPs are inherently suspicious because few real humans browse the web from AWS us-east-1. |
| Request Patterns | Does the client behave like a human browsing session? | A single GET request to /llms.txt with no cookies, no referer, no session context. That's not a browsing session; that's a surgical strike. |
| Cookie/Session Analysis | Does the client maintain state across requests? | AI crawlers typically don't. Stateless request, instant red flag. |
Each of these signals individually might not trigger a block. Together, they're a neon sign that reads "I AM NOT A HUMAN AND I'M NOT EVEN TRYING TO PRETEND."
Part 2: The Paradox, When Your Security Blocks Your Content Strategy
The llms.txt Access Problem
Here's the sequence of events that produces the paradox:
- A site operator creates an llms.txt file to help AI systems understand their content
- They deploy it behind a WAF (because of course they do; they're not running a production site without security)
- An AI system, exactly the kind of system the file was created for, tries to fetch
/llms.txt - The WAF sees a non-browser user agent from a data center IP making a stateless HTTP GET with no JavaScript capability
- The WAF blocks the request, returns a 403, or serves a JavaScript challenge page
- The AI system receives either an error or a blob of challenge HTML instead of structured Markdown
- The AI system falls back to whatever information it can get from search APIs, cached content, or its training data
- The site operator's carefully curated content goes unread by the systems it was specifically written for
Everyone involved is doing exactly what they're supposed to do. The system still doesn't work.
It Gets Worse: The Invisible Failure
The truly insidious part is that this failure is often invisible to the site operator. Most WAFs don't send an email saying "Hey, we blocked ClaudeBot from reading your llms.txt file." The block shows up in analytics dashboards (if you know where to look), but it doesn't generate an alert. Your llms.txt file is technically online. It's just functionally inaccessible to its entire target audience.
It's like publishing a book and then hiring a security guard to prevent anyone matching the physical description of "reads books" from entering the bookstore.
The robots.txt Complication
This problem interacts with robots.txt in ways that make diagnosis even harder:
- If your
robots.txtblocks AI crawlers and your WAF blocks them: the robots.txt takes precedence semantically (the crawler should respect robots.txt before attempting the fetch), but the WAF enforces the block mechanically. You might think you've solved the problem by updating robots.txt, but the WAF is still blocking requests from crawlers that don't check robots.txt first, or that check it, are allowed, and then get blocked anyway on the actual content request. - If your
robots.txtallows AI crawlers but your WAF blocks them: this is the paradox in its purest form. Your policy says "yes" and your infrastructure says "no."
Part 3: Diagnosing the Problem
Step 1: Check If Your llms.txt Is Accessible from Outside
The simplest test. Open a terminal and make a request that mimics how an AI crawler would fetch your file:
What to look for:
- Status code 200 for all three: Congratulations, your site isn't blocking AI crawlers. You may stop reading here. (You won't, though. You're curious. I respect that.)
- Status code 403 for AI user agents but 200 for the browser: Your WAF is selectively blocking AI crawlers. This is the most common scenario.
- Status code 503 with a large HTML response: You're getting served a JavaScript challenge page. The AI crawler sees HTML soup instead of your Markdown file.
- Connection timeout or reset for AI user agents: More aggressive blocking. The connection is being dropped entirely.
Step 2: Read the Response Headers
The response headers often tell you who is blocking the request and how:
Look for these headers:
| Header | What It Tells You |
|---|---|
server: cloudflare | Cloudflare is handling the response |
cf-ray: ... | Cloudflare request ID (confirms Cloudflare involvement) |
x-served-by: cache-... | Fastly is in the path |
x-akamai-transformed | Akamai is processing the response |
x-sucuri-id | Sucuri WAF is present |
cf-mitigated: challenge | Cloudflare issued a JS challenge, not your actual content |
Step 3: Check Your WAF Dashboard
If you have access to your WAF provider's dashboard, look for blocked or challenged requests to /llms.txt. In Cloudflare's Security dashboard, look under Security > Events and filter by path. You may find a series of blocked requests from AI crawlers that you never noticed because they didn't generate alerts.
In Cloudflare specifically, also check Security > Bots. This is the AI Audit panel where you can see exactly which AI crawlers have been detected and what action was taken on each.
Step 4: Check robots.txt Consistency
Verify that your robots.txt and WAF policies agree:
If your robots.txt says Allow: /llms.txt for AI crawlers but your WAF is blocking them, you have a policy conflict: your content policy says yes, your infrastructure says no.
Part 4: Mitigation Strategies for Site Operators
Strategy 1: WAF Allowlist Rules (Recommended)
Most WAFs let you create custom rules that bypass bot detection for specific paths. This is the cleanest solution: you're saying "yes, I know this looks like a bot request; that's the point."
Cloudflare example (WAF Custom Rules):
Create a rule in Security > WAF > Custom Rules:
Field: URI Path
Operator: equals
Value: /llms.txt
Action: Skip → All remaining custom rules
You can optionally narrow this to known AI user agents:
(http.request.uri.path eq "/llms.txt") and
(http.user_agent contains "GPTBot" or
http.user_agent contains "ClaudeBot" or
http.user_agent contains "anthropic-ai" or
http.user_agent contains "Google-Extended" or
http.user_agent contains "Applebot-Extended")
This ensures your llms.txt file is accessible to AI crawlers without relaxing security for the rest of your site.
For other WAFs, the specific syntax differs, but the concept is the same: create a rule that allows requests to /llms.txt (and optionally /llms-full.txt) from AI-associated user agents, and ensure that rule runs before the general bot-detection rules.
Strategy 2: Serve llms.txt from a CDN Edge (No WAF Path)
If your WAF provider supports edge-served static content (Cloudflare Workers, Fastly Compute@Edge, AWS CloudFront Functions), you can serve the llms.txt file directly from the edge without routing through the WAF at all:
// Cloudflare Worker — simplified example
addEventListener("fetch", (event) => {
const url = new URL(event.request.url);
if (url.pathname === "/llms.txt") {
return event.respondWith(
new Response(LLMS_TXT_CONTENT, {
headers: { "Content-Type": "text/markdown; charset=utf-8" },
}),
);
}
return event.respondWith(fetch(event.request));
});
This approach means the WAF never sees the request; it's handled at the CDN layer before reaching your origin server. The trade-off is that your llms.txt content lives in a Worker (or equivalent) rather than your deployment, so you need to update it separately.
Strategy 3: DNS-Level Separation
For sites where modifying WAF rules isn't practical (shared hosting, managed platforms, corporate IT policies you can't control), consider hosting your llms.txt on a subdomain that doesn't route through the WAF:
https://ai.yoursite.com/llms.txt ← No WAF
https://yoursite.com/llms.txt ← Redirects to ai.yoursite.com/llms.txt
This is a heavier-handed approach, but it works in environments where you can't modify WAF configuration. The llms.txt spec doesn't require the file to be at the root of the primary domain, and a clear redirect is sufficient for most AI systems.
Strategy 4: Cloudflare AI Audit Dashboard
If you're on Cloudflare, the AI Audit dashboard gives you granular control over which AI crawlers are allowed. Rather than creating custom WAF rules manually, you can:
- Navigate to Security > Bots > AI Scrapers and Crawlers
- Review the list of known AI crawlers
- Toggle specific crawlers between "Blocked" and "Allowed"
This is the least technical option and the one most likely to stay current as new AI crawlers emerge, since Cloudflare maintains the crawler list.
Strategy 5: Verified Bot Programs
Some AI providers offer "verified bot" programs that WAFs can recognize as legitimate:
- OpenAI publishes their IP ranges and GPTBot user agent specification
- Google verifies Googlebot and Google-Extended via DNS reverse lookup
- Anthropic publishes ClaudeBot's expected behavior and IP ranges
Check whether your WAF supports verified bot classification and whether you can allowlist verified AI bots specifically. This gives you the best of both worlds: you block unknown and unverified AI crawlers while allowing the ones you've made a conscious decision to permit.
Part 5: For Developers Building AI Fetching Tools
If you're building a tool that needs to fetch llms.txt files (like, say, a C#/.NET library for parsing, fetching, and validating llms.txt files, purely hypothetical example), here are the realities you'll encounter:
The User Agent Dilemma
Your tool needs to send a user agent string. You have two philosophically irreconcilable options:
- Be honest: Send
YourTool/1.0 (https://yourtool.dev)and accept that a meaningful percentage of your requests will be blocked - Impersonate a browser: Send a Chrome-like user agent string and accept that you're being deceptive about what you are
Option 1 is ethical but doesn't work reliably. Option 2 works reliably but isn't ethical. There is no option 3. Welcome to the llms.txt developer experience.
The responsible approach: be honest by default, implement configurable user agent strings so the human operator can decide what to send, and document the trade-offs clearly. Don't make the ethical decision for your users, but don't pretend there isn't one.
Retry and Degradation Strategies
Your fetching code should handle WAF blocks gracefully:
1. Attempt fetch with honest user agent
2. If blocked (403, 503 challenge, timeout):
a. Log the block with diagnostic details
b. Check if a cached version exists → use it
c. If no cache → return a structured error, not an exception
3. Never retry immediately; WAFs escalate responses to rapid retries
4. Implement exponential backoff if retrying at all
The worst thing your tool can do is hammer a WAF-protected endpoint with rapid retries. That's how you get your entire IP range blocklisted, and it's indistinguishable from a DDoS attack from the WAF's perspective. Don't be that bot.
What 403 Actually Means (It's Complicated)
A 403 Forbidden from a WAF doesn't necessarily mean "this resource doesn't exist" or "you don't have permission." It often means "my bot detection triggered before your request reached the application server." The resource exists, the server would happily serve it, but the security layer said no.
This matters for your error handling. A 403 from a WAF should be categorized differently from an application-level 403. It's a transient infrastructure block, not a permanent access denial. Your tool should communicate this distinction to its users.
Common Misconceptions
"If I allow AI crawlers in robots.txt, the WAF will respect that." No. robots.txt is a request to crawlers; the WAF doesn't read it. These are independent systems with no coordination between them.
"Only sketchy sites use WAFs." Virtually every production website uses some form of WAF or bot protection. If your site is on Cloudflare, Vercel, Netlify, AWS, or any major hosting provider, you probably have one whether you configured it or not.
"AI crawlers will figure out how to bypass WAFs eventually." They might. That creates an arms race that benefits nobody. The better path is explicit allowlisting of the content you want AI systems to access.
"My llms.txt file works fine in my browser, so AI crawlers can read it too." Your browser passes every bot-detection check automatically. It executes JavaScript, maintains cookies, sends a recognized user agent, and originates from a residential IP. An AI crawler does none of these things.
What This Means Going Forward
The WAF–AI interaction isn't a bug that'll be patched away. It's a structural tension between two legitimate concerns:
- Site operators need protection from the genuine threats that WAFs guard against
- AI systems need access to the content that site operators are explicitly choosing to share
Right now, the burden falls on site operators to configure exceptions for content they've already decided to share. That's backwards. It should be as easy to say "let AI systems read this" as it is to publish the content in the first place. Until that changes, the workarounds in this guide are the practical reality.
The llms.txt standard was designed with the best intentions: give AI systems curated content instead of making them parse your entire site. But a standard that's technically published yet practically inaccessible is a standard that isn't working. The infrastructure needs to catch up to the intent.
Troubleshooting
I created a WAF allowlist rule but AI crawlers are still blocked. Check rule ordering. WAF rules execute in priority order, and a higher-priority bot-blocking rule may fire before your allowlist rule. In Cloudflare, custom rules execute top-to-bottom; drag your allowlist rule above any managed rules or super bot fight mode rules.
My curl tests work but AI crawlers still report failures. Your test might be running from a residential IP; AI crawlers run from data center IPs that have different reputation scores. Try testing from a cloud VM (AWS, GCP, Azure) to simulate what the crawler actually experiences.
I'm on shared hosting and can't modify WAF rules. Contact your hosting provider's support and ask if they can create a path-based exception for /llms.txt. Alternatively, consider the DNS-separation strategy (hosting the file on a lightweight static host without WAF overhead). Even a free-tier GitHub Pages site can serve your llms.txt if your primary host won't cooperate.
The WAF is serving a JavaScript challenge instead of blocking outright. This is common with Cloudflare's "I'm Under Attack" mode and managed challenge settings. The challenge page returns a 200 or 503 status with an HTML page containing JavaScript that a browser would auto-execute. To an AI crawler, it's garbage data. To the WAF, it's doing exactly what it was configured to do. The fix is the same: allowlist the path for known AI user agents so the challenge is skipped.
Further Reading
- Web Application Firewall: Glossary entry covering WAF basics
- User Agent: The identity crisis in a single HTTP header
- robots.txt: The original "instructions for machines"
- llms.txt: The standard this whole guide orbits around
- LlmsTxtKit: A library that handles WAF blocks as a first-class concern
- I Write the Docs Before the Code: The blog post where the WAF paradox first surfaced
- Adding llms.txt to a Docusaurus Site: The companion guide for publishing llms.txt files