Web Standards and AI Discovery
These terms cover the protocols, standards, and infrastructure that determine whether AI systems can find, access, and use web content. This is the territory where the llms.txt research lives, and it's messier than you'd expect.
The short version: there are multiple half-overlapping standards, competing proposals, and a Web Application Firewall industry that doesn't particularly care about any of them. Welcome.
| Term | What it is |
|---|---|
| llms.txt | A proposed web standard providing AI systems with a curated Markdown summary of a site |
| llms-full.txt | Optional companion to llms.txt with full Markdown content of all linked pages |
| robots.txt | The original "instructions for machines," a plain-text access control file since the 1990s |
| Web Application Firewall (WAF) | The security layer that blocks malicious traffic (and AI crawlers as collateral damage) |
| Generative Engine Optimization (GEO) | Structuring content to be discovered and cited by AI systems (SEO's awkward cousin) |
| Content Signals | Google's proposed standard for AI usage rights and permissions |
| IETF aipref | An IETF proposal for AI access/usage preferences, formally specified but slowly ratified |
| CC Signals | Creative Commons' proposal for AI licensing and copyright preferences |
| User Agent | The identity string in every HTTP request, and the AI-web relationship's identity crisis |