robots.txt: The Complete Guide to Controlling Crawler Access

Published on 1 April 2026

The robots.txt file is one of the oldest and most powerful tools in a webmaster's arsenal. Sitting at the root of every domain, this plain-text file tells crawlers which parts of your site they may access and which they should leave alone. Despite its simplicity, robots.txt is also one of the most frequently misconfigured files on the web. A single misplaced rule can hide an entire site from search engines or expose sections you intended to keep private.

This guide covers everything from foundational syntax to advanced patterns, so you can write robots.txt rules with confidence.

Where robots.txt lives and how crawlers find it

Every compliant crawler, before requesting any other URL on a domain, fetches /robots.txt at the root:

https://example.com/robots.txt

Key rules about the file itself:

It must be served at the exact path /robots.txt — not /Robots.txt, not inside a subdirectory.
It must return a 200 status code with a text/plain content type. If the server returns a 404, crawlers assume everything is allowed. A 5xx error causes most crawlers to pause crawling entirely until the file becomes available.
The file applies per origin (scheme + host + port). A rule on https://example.com/robots.txt does not govern https://blog.example.com or http://example.com.
Maximum file size honoured by Google is 500 KiB. Anything beyond that limit is ignored.

Basic syntax

A robots.txt file is composed of one or more rule groups. Each group starts with one or more User-agent lines, followed by Disallow and/or Allow directives:

User-agent: Googlebot
Disallow: /private/
Allow: /private/public-page.html

User-agent: *
Disallow: /tmp/
Disallow: /internal/

User-agent

The User-agent line specifies which crawler the following rules apply to. The wildcard * matches any crawler that does not have its own specific block. Crawlers match themselves against the most specific User-agent block available. If both User-agent: * and User-agent: Googlebot exist, Googlebot uses only the Googlebot-specific block and ignores the wildcard block entirely.

Disallow

Disallow: /path/ tells the matched crawler not to access any URL that starts with /path/. An empty Disallow: (with no path) means nothing is disallowed—equivalent to full access.

Allow

Allow: /path/ explicitly permits access to URLs matching the pattern, even if a broader Disallow would otherwise block them. This is essential for carving out exceptions.

Wildcards and pattern matching

The original 1994 robots.txt specification did not include wildcards, but Google, Bing and most modern crawlers support two important pattern-matching characters:

The asterisk: `*`

Matches any sequence of characters (including an empty string). Examples:

Disallow: /*.pdf — blocks all URLs containing .pdf anywhere in the path.
Disallow: /directory/*/page — blocks URLs like /directory/anything/page.

The dollar sign: `$`

Anchors the match to the end of the URL. Without $, patterns match as prefixes. Examples:

Disallow: /*.pdf$ — blocks URLs that end with .pdf but allows /file.pdf?view=1 (because the URL does not end at .pdf).
Allow: /page$ — allows exactly /page but not /page/subpage or /page?q=1.

Allow vs Disallow: which one wins?

When a URL matches both an Allow and a Disallow rule, the resolution depends on specificity (path length). Google's implementation follows this logic:

The rule with the longer matching path wins.
If both rules have the same length, Allow takes precedence.

Example:

User-agent: *
Disallow: /directory/
Allow: /directory/public/

Here, /directory/public/page.html is allowed because /directory/public/ (20 characters) is longer than /directory/ (11 characters). But /directory/secret.html remains blocked.

This is a common source of confusion. Always test your rules with a robots.txt tester to confirm the outcome for specific URLs.

The Crawl-delay directive

Crawl-delay requests that the crawler wait a specified number of seconds between successive requests:

User-agent: Bingbot
Crawl-delay: 10

Important caveats:

Google ignores Crawl-delay entirely. To control Googlebot's crawl rate, use the Crawl Rate setting in Google Search Console.
Bing respects it. A value of 10 means Bingbot will wait 10 seconds between requests.
Yandex, Baidu and some other crawlers also honour it, though implementations vary.
Setting an excessively high value (e.g., 60) effectively stops crawling. Use this sparingly and only when your server genuinely cannot handle the load.

The Sitemap directive

You can declare sitemaps directly in robots.txt:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Key points:

The Sitemap directive is not tied to any User-agent block. Place it at the top or bottom of the file—it applies globally.
The URL must be fully qualified (absolute URL with scheme).
You can list multiple sitemaps.
This is a discovery hint, not a guarantee. Submitting sitemaps through Search Console or Bing Webmaster Tools is more reliable.

Common mistakes that break crawling

1. Blocking CSS and JavaScript

User-agent: *
Disallow: /assets/
Disallow: /js/
Disallow: /css/

Modern search engines render pages to evaluate content and layout. If you block the resources needed for rendering, the crawler sees a broken page—and may downgrade or skip it. Only block resources that are genuinely irrelevant to rendering public content.

2. Using Disallow to keep pages out of the index

Disallow prevents crawling, not indexing. If other pages link to a disallowed URL, Google can still index it—it just will not have any content to show, resulting in a cryptic listing. To truly remove a page from the index, use a noindex meta tag or X-Robots-Tag header and allow the crawler to see the page.

3. Forgetting the trailing slash

Disallow: /private   # blocks /private, /private.html, /privately, etc.
Disallow: /private/  # blocks only paths inside the /private/ directory

The first pattern is broader than most people intend. Always consider whether you need the trailing slash.

4. Conflicting wildcard and specific blocks

Having a User-agent: * block and a bot-specific block where the specific block is empty effectively gives that bot full access—even if the wildcard block is restrictive. This is by design, but it surprises people who assume rules accumulate.

5. Serving robots.txt behind a redirect

If /robots.txt returns a 301 or 302, most crawlers will follow the redirect. However, chain redirects, redirect loops, or redirecting to a non-text/plain response will cause crawlers to treat the file as unavailable. Keep it simple: serve the file directly at the root with a 200 response.

6. Not accounting for AI crawlers

If your robots.txt only has a User-agent: * block, any AI crawler you have not explicitly blocked will have the same access as Googlebot. Consider adding specific rules for bots like GPTBot, ClaudeBot, PerplexityBot and Bytespider.

Testing your robots.txt

Never deploy robots.txt changes without testing. Available tools include:

Google Search Console — the robots.txt Tester shows how Google interprets your rules for specific URLs.
Bing Webmaster Tools — similar testing functionality for Bingbot.
Spider.es — check which crawlers (search engines, AI bots, SEO tools) can access any URL on your domain, with the specific rule that controls each verdict.
Command-line tools — libraries like Python's urllib.robotparser let you automate testing in CI/CD pipelines.

A solid starting template

# Search engines: full access
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /
Crawl-delay: 5

# AI crawlers: selective
User-agent: GPTBot
Disallow: /premium/
Allow: /blog/

User-agent: Google-Extended
Disallow: /

# Default: allow with restrictions
User-agent: *
Disallow: /admin/
Disallow: /tmp/
Disallow: /search?*

Sitemap: https://example.com/sitemap.xml

Adapt this to your needs. The principle is simple: be explicit about what you allow, deliberate about what you block, and test before you deploy.

Final thoughts

robots.txt is deceptively simple. A few lines of text control whether millions of people can discover your content through search engines and AI tools. Treat it with the same care you give to your site's security configuration. Audit it regularly—especially as new AI crawlers appear—and use tools like Spider.es to verify that your intended policy matches reality.

Back to the blog

spider.es

Domain overview

robots.txt

Additional files

Meta robots

Headers

robots.txt: The Complete Guide to Controlling Crawler Access

Where robots.txt lives and how crawlers find it

Basic syntax

User-agent

Disallow

Allow

Wildcards and pattern matching

The asterisk: `*`

The dollar sign: `$`

Allow vs Disallow: which one wins?

The Crawl-delay directive

The Sitemap directive

Common mistakes that break crawling

1. Blocking CSS and JavaScript

2. Using Disallow to keep pages out of the index

3. Forgetting the trailing slash

4. Conflicting wildcard and specific blocks

5. Serving robots.txt behind a redirect

6. Not accounting for AI crawlers

Testing your robots.txt

A solid starting template

Final thoughts

Domain overview

robots.txt

Additional files

Meta robots

Headers

Where robots.txt lives and how crawlers find it

Basic syntax

User-agent

Disallow

Allow

Wildcards and pattern matching

The asterisk: *

The dollar sign: $

Allow vs Disallow: which one wins?

The Crawl-delay directive

The Sitemap directive

Common mistakes that break crawling

1. Blocking CSS and JavaScript

2. Using Disallow to keep pages out of the index

3. Forgetting the trailing slash

4. Conflicting wildcard and specific blocks

5. Serving robots.txt behind a redirect

6. Not accounting for AI crawlers

Testing your robots.txt

A solid starting template

Final thoughts

The asterisk: `*`

The dollar sign: `$`