The Complete robots.txt Reference: Precedence, Wildcards, AI Bots & Real-World Receipts

No Comments

Most robots.txt files are copied from a template, pasted once, and never understood. That is how a single line silently removes a site from half the web's crawlers, or, just as often, fails to block the thing the owner thought it blocked. This is the complete reference: what every directive does, the precedence rules almost nobody reads, real configurations pulled from live sites in June 2026, and the mistakes that quietly cost traffic.

🗺️ TL;DR

robots.txt controls crawling, not indexing, a disallowed URL can still rank if it's linked. Crawlers obey only their single most-specific user-agent group, which is why User-agent: * / Disallow: / doesn't block Google when Googlebot has its own group. Within a group, Google picks the most-specific (longest) rule, not the first, but legacy crawlers use first-match, so order still matters if you care about more than Google. Never use robots.txt to hide a page you also want de-indexed: if Google can't crawl it, it can't see your noindex.

500 KB
Google's robots.txt size cap, content past it is ignored
1 group
a crawler obeys only its single most-specific user-agent group, and ignores all the others
≠ noindex
robots.txt blocks crawling, never indexing, these are different jobs

📍 What robots.txt is, and the three things it is not

robots.txt is a plain-text file at the root of a host (https://example.com/robots.txt) that tells compliant crawlers which paths they may request. It follows the Robots Exclusion Protocol, standardised as RFC 9309 in 2022. That is all it does. It is not:

  • Not a security control. The file is public and advisory. Listing Disallow: /admin/ tells the world where your admin lives while stopping no determined actor. Protect with authentication, not robots.txt.
  • Not a way to de-index. Disallowing a URL stops crawling, but Google can still index a disallowed URL it finds via links, showing it with no snippet. To remove a page from search, use a noindex meta tag or header (and leave it crawlable so Google can see it).
  • Not a guarantee. Compliance is voluntary. Mainstream crawlers honour it; scrapers and some AI bots do not (covered below).

🔧 The anatomy: every directive that matters

A robots.txt file is one or more groups. Each group starts with one or more User-agent lines and contains Allow/Disallow rules. Sitemap is independent of groups.

User-agent: Googlebot # this group applies to Googlebot only
Disallow: /cart/ # block a folder
Allow: /cart/promo.html # …but allow one file inside it
Disallow: /*?sort= # wildcard: any URL containing ?sort=
Disallow: /*.pdf$ # $ anchors the end: only URLs ending .pdf

User-agent: * # everyone else
Disallow: /private/

Sitemap: https://example.com/sitemap.xml # absolute URL, group-independent
  • User-agent, names the crawler the group applies to. * matches any bot that doesn't have its own group.
  • Disallow, a path prefix the crawler must not request. Empty Disallow: means "allow everything."
  • Allow, carves an exception out of a broader Disallow. Supported by Google and most majors.
  • * and $, * matches any sequence of characters; $ anchors the end of the URL. Google supports both; some legacy crawlers ignore them entirely.
  • Sitemap, an absolute URL to your XML sitemap. It is global, not tied to any user-agent group, and you can list several.

⚖️ The precedence rules almost nobody reads

This is where well-meaning files go wrong. Two separate precedence questions decide what a crawler actually does, and they resolve differently across crawlers.

How a crawler decides what it may fetch
1. Pick ONE group
The crawler selects the single most-specific user-agent group matching its name, and ignores every other group entirely.
2. Match the path
Within that one group, find the rules whose path matches the URL.
3. Resolve the winner
Google: the most-specific (longest) rule wins; a tie goes to Allow.
Legacy/first-match: the first matching rule wins, so order matters.

Rule 1, User-agent groups: most specific wins, and it's winner-take-all

A crawler reads only the one group whose user-agent token is the longest match for its name. It does not merge groups. So a Googlebot group makes Googlebot ignore the * group completely, even the rules you assumed were universal.

Receipt (live, June 2026): LinkedIn, Yelp and Reuters all publish User-agent: * → Disallow: /. It looks like "block the entire internet." It isn't: each of them also defines dedicated Googlebot and Bingbot groups, so those engines crawl freely while everyone else is turned away. The * block never applies to Google because Google matched a more specific group first. LinkedIn even appends a note: apply for whitelisting by email.

Rule 2, Within a group: Google uses specificity, legacy crawlers use order

Once a crawler is inside its group and a URL matches several rules, who wins? Here the standards diverge, and this is the part that surprises people:

  • Googlebot (and most modern majors): the most-specific rule wins, specificity measured by the number of characters in the path. Order in the file is irrelevant. If an Allow and a Disallow are exactly equally specific, Google breaks the tie toward the least restrictive rule (the Allow).
  • Legacy / strict-RFC crawlers: many older or simpler parsers use first-match-wins. With those, the order of your Allow/Disallow lines changes the outcome.

So the honest answer to "does the order of rules in robots.txt matter?" is: not for Google, but yes for a meaningful slice of other crawlers. Write your file so it produces the correct result under both interpretations, put the specific Allow before the broad Disallow, and don't rely on Google's longest-match cleverness if you care who else obeys you.

# Correct under BOTH longest-match and first-match crawlers:
User-agent: *
Allow: /downloads/whitepaper.pdf # specific exception first
Disallow: /downloads/ # broad block second

🌐 Real configurations in the wild (June 2026)

Pulled live from production robots.txt files, these are observations, not value judgements. The point is what each does.

Site(s)ConfigurationEffect
LinkedIn, Yelp, ReutersUser-agent: *Disallow: / + dedicated Googlebot/Bingbot groupsDefault-deny allowlist: only named engines crawl; everyone else is blocked
NYTStandalone Disallow: / for OAI-SearchBot, PerplexityBot, Claude-SearchBotDeliberately absent from ChatGPT, Perplexity and Claude citations
CNN, BBC, The Verge, The Guardian, Amazon, eBayBlock one or more AI search crawlersReduced or zero eligibility for those engines' answer citations
GitHub, WalmartCrawl-delay: 1 / Crawl-delay: 5Honoured by Bing/Yandex; ignored by Google, which has no crawl-delay support

The AI-bot blocks above are usually deliberate (licensing posture). The trap is doing the same thing by accident, see the AI section below.

🚫 The mistakes that quietly cost you

Blocking a page you also want de-indexed

The most common own-goal. You add noindex to a page and Disallow it in robots.txt. Google can no longer crawl the page, so it never sees the noindex, and the URL can linger in the index. Pick one: to de-index, allow crawling and use noindex; to block crawling, accept the URL may still appear.

Using unsupported directives

Noindex: in robots.txt has not been supported by Google since September 2019. Crawl-delay is ignored by Google. Both are silently skipped, you think you've set a rule that does nothing.

Blocking render-critical CSS/JS

If Googlebot can't fetch the CSS and JavaScript needed to render a page, it evaluates a broken version of it. Never disallow the assets your layout and content depend on. (Blocking a single tracking or JSONP endpoint is fine, blocking /assets/ or /wp-includes/ wholesale is not.)

Path and scope slips

Paths are case-sensitive (/Folder//folder/); directive names are not. robots.txt is scoped to one origin, protocol, host and port, so https:// and http://, and each subdomain, need their own file. And it must sit at the root; a file at /blog/robots.txt is ignored.

🤖 robots.txt and AI crawlers

The fastest-growing reason to touch robots.txt in 2026 is AI. The critical distinction, covered in depth in The AI Crawler Map, is that AI bots do different jobs: some train models, some index pages to answer questions, some fetch a page live for a user. Blocking a training crawler costs almost no traffic; blocking a search/answer crawler removes you from that engine's citations. The publisher blocks in the table above are deliberate licensing decisions. The danger is replicating them by pasting a "block all AI" snippet you found online and silently deleting yourself from ChatGPT Search, Perplexity and Claude. The trap is reusing a broad "block all AI" snippet without checking which bots it actually covers, so a training block quietly takes out the search and answer crawlers too.

And remember: robots.txt is a request. Independent monitoring through 2025 reported that some AI operators fetched disallowed pages anyway. If you need a hard block, enforce it at the edge (WAF, verified IP ranges), not in a text file.

🧪 How to test it

  1. Read the live file. Open yourdomain.com/robots.txt in a browser, that's exactly what crawlers see.
  2. Inspect a URL in Search Console. The URL Inspection tool reports whether a specific URL is blocked by robots.txt for Googlebot.
  3. Use a validator. Google's open-source robots.txt parser (the basis for its real behaviour) and several third-party testers will tell you whether a given URL/user-agent pair is allowed.
  4. Confirm assets render. Use the URL Inspection "view rendered page" to ensure CSS/JS aren't blocked.

❓ robots.txt FAQ

Does robots.txt stop a page from being indexed?

No. It stops crawling. A disallowed URL can still be indexed (without a snippet) if other pages link to it. Use noindex to keep a page out of search.

Does the order of rules matter?

Not for Google, which uses the most-specific rule regardless of order. Yes for first-match crawlers. Write your file to be correct under both.

Is robots.txt case-sensitive?

The paths are (/Page/page); the directive keywords (User-agent, Disallow) are not.

Can I block a folder but allow one file inside it?

Yes, add an Allow for the specific file. It's more specific than the folder Disallow, so it wins for Google; place it first for legacy crawlers too.

Does Google support crawl-delay?

No. Google ignores it; control Googlebot's rate via Search Console settings. Bing and Yandex do honour it.

How large can robots.txt be?

Google reads up to 500 KB; anything beyond is ignored. Keep it lean.

Does one robots.txt cover my subdomains?

No. It's per-origin. blog.example.com and shop.example.com each need their own, as do http and https.

What happens if I have no robots.txt at all?

Crawlers assume everything is allowed. A missing file (returning 404) is treated as "crawl freely", which is often fine.

Want your robots.txt audited against Google, Bing, and every AI engine that matters?

An advanced technical audit checks your crawl directives, rendering, and AI-bot access, and tells you exactly what to change, with the receipts.

Request an advanced technical SEO audit →

Claude Vincent is a technical SEO consultant focused on crawlability, rendering, and AI-search visibility. He writes the field guides and case studies at SEO ProCheck, with a bias toward the durable, unglamorous work that decides whether search engines and AI answer engines can actually read and cite a site.

    About SEO ProCheck

    Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.

    Work With Me

    Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.

    Subscribe to our newsletter!

    More from our blog