Web Standards and AI Discovery

These terms cover the protocols, standards, and infrastructure that determine whether AI systems can find, access, and use web content. This is the territory where the llms.txt research lives, and it's messier than you'd expect.

The short version: there are multiple half-overlapping standards, competing proposals, and a Web Application Firewall industry that doesn't particularly care about any of them. Welcome.

Term	What it is
llms.txt	A proposed web standard providing AI systems with a curated Markdown summary of a site
llms-full.txt	Optional companion to llms.txt with full Markdown content of all linked pages
robots.txt	The original "instructions for machines," a plain-text access control file since the 1990s
Web Application Firewall (WAF)	The security layer that blocks malicious traffic (and AI crawlers as collateral damage)
Generative Engine Optimization (GEO)	Structuring content to be discovered and cited by AI systems (SEO's awkward cousin)
Content Signals	Google's proposed standard for AI usage rights and permissions
IETF aipref	An IETF proposal for AI access/usage preferences, formally specified but slowly ratified
CC Signals	Creative Commons' proposal for AI licensing and copyright preferences
User Agent	The identity string in every HTTP request, and the AI-web relationship's identity crisis