Skip to main content

robots.txt

A plain text file at a website's root (/robots.txt) that tells web crawlers which parts of the site they're allowed to access. It's been a web standard since the mid-1990s (predating most of the internet we recognize today) and works on the honor system. User-agent directives target specific crawlers; Disallow rules block access to specific paths.

It's essentially a "please knock before entering" sign. Reputable crawlers knock. The rest walk in through the window.

Why it matters for writers: robots.txt is the original "instructions for machines that visit your site." The emergence of AI-specific crawlers (GPTBot, ClaudeBot, Google-Extended, etc.) has created new questions about what to allow and block. Some site owners block AI crawlers entirely. Others block training crawlers but allow inference-time access: a distinction that robots.txt is technically capable of expressing but that very few site owners know how to configure. The llms.txt standard was partly motivated by robots.txt's fundamental limitation: it can only say "yes" or "no," never "here's what you should actually look at."

Related terms: llms.txt · User Agent · Web Application Firewall