robots.txt: The Complete Guide to Controlling Crawler Access
The robots.txt file is one of the oldest and most powerful tools in a webmaster's arsenal. Sitting at the root of every domain, this plain-text file tells crawlers which parts of your site they may access and which they should leave alone. Despite its simplicity, robots.txt is also one of the most frequently misconfigured files on the web. A single misplaced rule can hide an entire site from search engines or expose sections you intended to keep private.
This guide covers everything from foundational syntax to advanced patterns, so you can write robots.txt rules with confidence.
Where robots.txt lives and how crawlers find it
Every compliant crawler, before requesting any other URL on a domain, fetches /robots.txt at the root:
https://example.com/robots.txt
Key rules about the file itself:
- It must be served at the exact path
/robots.txt— not/Robots.txt, not inside a subdirectory. - It must return a 200 status code with a
text/plaincontent type. If the server returns a 404, crawlers assume everything is allowed. A 5xx error causes most crawlers to pause crawling entirely until the file becomes available. - The file applies per origin (scheme + host + port). A rule on
https://example.com/robots.txtdoes not governhttps://blog.example.comorhttp://example.com. - Maximum file size honoured by Google is 500 KiB. Anything beyond that limit is ignored.
Basic syntax
A robots.txt file is composed of one or more rule groups. Each group starts with one or more User-agent lines, followed by Disallow and/or Allow directives:
User-agent: Googlebot
Disallow: /private/
Allow: /private/public-page.html
User-agent: *
Disallow: /tmp/
Disallow: /internal/
User-agent
The User-agent line specifies which crawler the following rules apply to. The wildcard * matches any crawler that does not have its own specific block. Crawlers match themselves against the most specific User-agent block available. If both User-agent: * and User-agent: Googlebot exist, Googlebot uses only the Googlebot-specific block and ignores the wildcard block entirely.
Disallow
Disallow: /path/ tells the matched crawler not to access any URL that starts with /path/. An empty Disallow: (with no path) means nothing is disallowed—equivalent to full access.
Allow
Allow: /path/ explicitly permits access to URLs matching the pattern, even if a broader Disallow would otherwise block them. This is essential for carving out exceptions.
Wildcards and pattern matching
The original 1994 robots.txt specification did not include wildcards, but Google, Bing and most modern crawlers support two important pattern-matching characters:
The asterisk: *
Matches any sequence of characters (including an empty string). Examples:
Disallow: /*.pdf— blocks all URLs containing.pdfanywhere in the path.Disallow: /directory/*/page— blocks URLs like/directory/anything/page.
The dollar sign: $
Anchors the match to the end of the URL. Without $, patterns match as prefixes. Examples:
Disallow: /*.pdf$— blocks URLs that end with.pdfbut allows/file.pdf?view=1(because the URL does not end at.pdf).Allow: /page$— allows exactly/pagebut not/page/subpageor/page?q=1.
Allow vs Disallow: which one wins?
When a URL matches both an Allow and a Disallow rule, the resolution depends on specificity (path length). Google's implementation follows this logic:
- The rule with the longer matching path wins.
- If both rules have the same length,
Allowtakes precedence.
Example:
User-agent: *
Disallow: /directory/
Allow: /directory/public/
Here, /directory/public/page.html is allowed because /directory/public/ (20 characters) is longer than /directory/ (11 characters). But /directory/secret.html remains blocked.
This is a common source of confusion. Always test your rules with a robots.txt tester to confirm the outcome for specific URLs.
The Crawl-delay directive
Crawl-delay requests that the crawler wait a specified number of seconds between successive requests:
User-agent: Bingbot
Crawl-delay: 10
Important caveats:
- Google ignores
Crawl-delayentirely. To control Googlebot's crawl rate, use the Crawl Rate setting in Google Search Console. - Bing respects it. A value of 10 means Bingbot will wait 10 seconds between requests.
- Yandex, Baidu and some other crawlers also honour it, though implementations vary.
- Setting an excessively high value (e.g., 60) effectively stops crawling. Use this sparingly and only when your server genuinely cannot handle the load.
The Sitemap directive
You can declare sitemaps directly in robots.txt:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Key points:
- The
Sitemapdirective is not tied to any User-agent block. Place it at the top or bottom of the file—it applies globally. - The URL must be fully qualified (absolute URL with scheme).
- You can list multiple sitemaps.
- This is a discovery hint, not a guarantee. Submitting sitemaps through Search Console or Bing Webmaster Tools is more reliable.
Common mistakes that break crawling
1. Blocking CSS and JavaScript
User-agent: *
Disallow: /assets/
Disallow: /js/
Disallow: /css/
Modern search engines render pages to evaluate content and layout. If you block the resources needed for rendering, the crawler sees a broken page—and may downgrade or skip it. Only block resources that are genuinely irrelevant to rendering public content.
2. Using Disallow to keep pages out of the index
Disallow prevents crawling, not indexing. If other pages link to a disallowed URL, Google can still index it—it just will not have any content to show, resulting in a cryptic listing. To truly remove a page from the index, use a noindex meta tag or X-Robots-Tag header and allow the crawler to see the page.
3. Forgetting the trailing slash
Disallow: /private # blocks /private, /private.html, /privately, etc.
Disallow: /private/ # blocks only paths inside the /private/ directory
The first pattern is broader than most people intend. Always consider whether you need the trailing slash.
4. Conflicting wildcard and specific blocks
Having a User-agent: * block and a bot-specific block where the specific block is empty effectively gives that bot full access—even if the wildcard block is restrictive. This is by design, but it surprises people who assume rules accumulate.
5. Serving robots.txt behind a redirect
If /robots.txt returns a 301 or 302, most crawlers will follow the redirect. However, chain redirects, redirect loops, or redirecting to a non-text/plain response will cause crawlers to treat the file as unavailable. Keep it simple: serve the file directly at the root with a 200 response.
6. Not accounting for AI crawlers
If your robots.txt only has a User-agent: * block, any AI crawler you have not explicitly blocked will have the same access as Googlebot. Consider adding specific rules for bots like GPTBot, ClaudeBot, PerplexityBot and Bytespider.
Testing your robots.txt
Never deploy robots.txt changes without testing. Available tools include:
- Google Search Console — the robots.txt Tester shows how Google interprets your rules for specific URLs.
- Bing Webmaster Tools — similar testing functionality for Bingbot.
- Spider.es — check which crawlers (search engines, AI bots, SEO tools) can access any URL on your domain, with the specific rule that controls each verdict.
- Command-line tools — libraries like Python's
urllib.robotparserlet you automate testing in CI/CD pipelines.
A solid starting template
# Search engines: full access
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
Crawl-delay: 5
# AI crawlers: selective
User-agent: GPTBot
Disallow: /premium/
Allow: /blog/
User-agent: Google-Extended
Disallow: /
# Default: allow with restrictions
User-agent: *
Disallow: /admin/
Disallow: /tmp/
Disallow: /search?*
Sitemap: https://example.com/sitemap.xml
Adapt this to your needs. The principle is simple: be explicit about what you allow, deliberate about what you block, and test before you deploy.
Final thoughts
robots.txt is deceptively simple. A few lines of text control whether millions of people can discover your content through search engines and AI tools. Treat it with the same care you give to your site's security configuration. Audit it regularly—especially as new AI crawlers appear—and use tools like Spider.es to verify that your intended policy matches reality.