robots.txt at 30: From the Birth of the Web to the Age of AI

Few pieces of the web have aged as well as a humble plain-text file. robots.txt was born in 1994, when the web had only a few thousand sites, and three decades later it remains the first line of dialogue between your website and the robots that traverse it. Its story is, in many ways, the story of how the internet learned to coexist with the machines that crawl it.

A gentlemen's agreement

The protocol was proposed by engineer Martijn Koster in 1994, after a misconfigured crawler overwhelmed a server. The idea was simple and elegant: a file at the root of the site, /robots.txt, where the owner tells bots which parts they may visit and which they may not. It was never a technical barrier — it was a gentlemen's agreement: well-behaved bots respect it voluntarily.

From convention to official standard

For nearly three decades, robots.txt functioned as a de facto convention that everyone followed but no standards body had ever formalised. That changed in September 2022, when the IETF published RFC 9309, the Robots Exclusion Protocol, driven in large part by Google. At last, the rules the industry had always taken for granted were written down officially and without ambiguity.

How it works, in essence

The mechanics have barely changed:

  • User-agent identifies the bot the rule targets.
  • Disallow and Allow mark the paths that are off-limits or permitted.
  • Wildcards are supported (* and $), and the most specific rule wins.
  • Sitemap points to your sitemap.xml.

Simple, human-readable and portable: it works identically on any server and with any bot that chooses to obey it.

The challenge of the AI era

The biggest test of its maturity has come from AI crawlers. In recent years, robots.txt has filled up with new names — GPTBot, ClaudeBot, Google-Extended, PerplexityBot — as publishers try to decide who may use their content to train models or generate answers. The 1994 protocol has become, without ever intending to, the battleground for the debate over AI and copyright.

Its limits are still there

It is worth remembering what robots.txt is not. It does not technically block anyone: a malicious bot can ignore it entirely. It does not protect sensitive content — that is what authentication and server-level permissions are for — and it does not guarantee a page stays unindexed if other sites link to it. It is a statement of intent, not a wall.

How Spider helps

Three decades on, the key question remains the same: are your rules doing what you think they are? Spider.es interprets your robots.txt exactly as each bot would and shows you, across more than a hundred crawlers — from Googlebot to the latest AI scrapers — who can access every part of your site. The best way to honour such a long-lived standard is to make sure yours is written correctly.

Back to the blog