The Complete robots.txt Reference: Precedence, Wildcards, AI Bots & Real-World Receipts
- May 22, 2026
- Indexing & Crawl
Most robots.txt files are copied from a template, pasted once, and never understood. That is how a single line silently removes a site from half the web's crawlers, or, just as often, fails to block the thing the owner thought it blocked. This is the complete reference: what every directive does, the precedence rules almost nobody reads, real configurations pulled from live sites in June 2026, and the mistakes that quietly cost traffic.
robots.txt controls crawling, not indexing, a disallowed URL can still rank if it's linked. Crawlers obey only their single most-specific user-agent group, which is why User-agent: * / Disallow: / doesn't block Google when Googlebot has its own group. Within a group, Google picks the most-specific (longest) rule, not the first, but legacy crawlers use first-match, so order still matters if you care about more than Google. Never use robots.txt to hide a page you also want de-indexed: if Google can't crawl it, it can't see your noindex.
500 KB Google's robots.txt size cap, content past it is ignored | 1 group a crawler obeys only its single most-specific user-agent group, and ignores all the others | ≠ noindex robots.txt blocks crawling, never indexing, these are different jobs |
📍 What robots.txt is, and the three things it is not
robots.txt is a plain-text file at the root of a host (https://example.com/robots.txt) that tells compliant crawlers which paths they may request. It follows the Robots Exclusion Protocol, standardised as RFC 9309 in 2022. That is all it does. It is not:
- Not a security control. The file is public and advisory. Listing
Disallow: /admin/tells the world where your admin lives while stopping no determined actor. Protect with authentication, not robots.txt. - Not a way to de-index. Disallowing a URL stops crawling, but Google can still index a disallowed URL it finds via links, showing it with no snippet. To remove a page from search, use a
noindexmeta tag or header (and leave it crawlable so Google can see it). - Not a guarantee. Compliance is voluntary. Mainstream crawlers honour it; scrapers and some AI bots do not (covered below).
🔧 The anatomy: every directive that matters
A robots.txt file is one or more groups. Each group starts with one or more User-agent lines and contains Allow/Disallow rules. Sitemap is independent of groups.
User-agent: Googlebot # this group applies to Googlebot only
Disallow: /cart/ # block a folder
Allow: /cart/promo.html # …but allow one file inside it
Disallow: /*?sort= # wildcard: any URL containing ?sort=
Disallow: /*.pdf$ # $ anchors the end: only URLs ending .pdf
User-agent: * # everyone else
Disallow: /private/
Sitemap: https://example.com/sitemap.xml # absolute URL, group-independentUser-agent, names the crawler the group applies to.*matches any bot that doesn't have its own group.Disallow, a path prefix the crawler must not request. EmptyDisallow:means "allow everything."Allow, carves an exception out of a broaderDisallow. Supported by Google and most majors.*and$,*matches any sequence of characters;$anchors the end of the URL. Google supports both; some legacy crawlers ignore them entirely.Sitemap, an absolute URL to your XML sitemap. It is global, not tied to any user-agent group, and you can list several.
⚖️ The precedence rules almost nobody reads
This is where well-meaning files go wrong. Two separate precedence questions decide what a crawler actually does, and they resolve differently across crawlers.
The crawler selects the single most-specific user-agent group matching its name, and ignores every other group entirely.
Within that one group, find the rules whose path matches the URL.
Google: the most-specific (longest) rule wins; a tie goes to
Allow.Legacy/first-match: the first matching rule wins, so order matters.
Rule 1, User-agent groups: most specific wins, and it's winner-take-all
A crawler reads only the one group whose user-agent token is the longest match for its name. It does not merge groups. So a Googlebot group makes Googlebot ignore the * group completely, even the rules you assumed were universal.
Receipt (live, June 2026): LinkedIn, Yelp and Reuters all publish User-agent: * → Disallow: /. It looks like "block the entire internet." It isn't: each of them also defines dedicated Googlebot and Bingbot groups, so those engines crawl freely while everyone else is turned away. The * block never applies to Google because Google matched a more specific group first. LinkedIn even appends a note: apply for whitelisting by email.
Rule 2, Within a group: Google uses specificity, legacy crawlers use order
Once a crawler is inside its group and a URL matches several rules, who wins? Here the standards diverge, and this is the part that surprises people:
- Googlebot (and most modern majors): the most-specific rule wins, specificity measured by the number of characters in the path. Order in the file is irrelevant. If an
Allowand aDisalloware exactly equally specific, Google breaks the tie toward the least restrictive rule (theAllow). - Legacy / strict-RFC crawlers: many older or simpler parsers use first-match-wins. With those, the order of your
Allow/Disallowlines changes the outcome.
So the honest answer to "does the order of rules in robots.txt matter?" is: not for Google, but yes for a meaningful slice of other crawlers. Write your file so it produces the correct result under both interpretations, put the specific Allow before the broad Disallow, and don't rely on Google's longest-match cleverness if you care who else obeys you.
# Correct under BOTH longest-match and first-match crawlers:
User-agent: *
Allow: /downloads/whitepaper.pdf # specific exception first
Disallow: /downloads/ # broad block second🌐 Real configurations in the wild (June 2026)
Pulled live from production robots.txt files, these are observations, not value judgements. The point is what each does.
| Site(s) | Configuration | Effect |
|---|---|---|
| LinkedIn, Yelp, Reuters | User-agent: * → Disallow: / + dedicated Googlebot/Bingbot groups | Default-deny allowlist: only named engines crawl; everyone else is blocked |
| NYT | Standalone Disallow: / for OAI-SearchBot, PerplexityBot, Claude-SearchBot | Deliberately absent from ChatGPT, Perplexity and Claude citations |
| CNN, BBC, The Verge, The Guardian, Amazon, eBay | Block one or more AI search crawlers | Reduced or zero eligibility for those engines' answer citations |
| GitHub, Walmart | Crawl-delay: 1 / Crawl-delay: 5 | Honoured by Bing/Yandex; ignored by Google, which has no crawl-delay support |
The AI-bot blocks above are usually deliberate (licensing posture). The trap is doing the same thing by accident, see the AI section below.
🚫 The mistakes that quietly cost you
The most common own-goal. You add noindex to a page and Disallow it in robots.txt. Google can no longer crawl the page, so it never sees the noindex, and the URL can linger in the index. Pick one: to de-index, allow crawling and use noindex; to block crawling, accept the URL may still appear.
Noindex: in robots.txt has not been supported by Google since September 2019. Crawl-delay is ignored by Google. Both are silently skipped, you think you've set a rule that does nothing.
If Googlebot can't fetch the CSS and JavaScript needed to render a page, it evaluates a broken version of it. Never disallow the assets your layout and content depend on. (Blocking a single tracking or JSONP endpoint is fine, blocking /assets/ or /wp-includes/ wholesale is not.)
Paths are case-sensitive (/Folder/ ≠ /folder/); directive names are not. robots.txt is scoped to one origin, protocol, host and port, so https:// and http://, and each subdomain, need their own file. And it must sit at the root; a file at /blog/robots.txt is ignored.
🤖 robots.txt and AI crawlers
The fastest-growing reason to touch robots.txt in 2026 is AI. The critical distinction, covered in depth in The AI Crawler Map, is that AI bots do different jobs: some train models, some index pages to answer questions, some fetch a page live for a user. Blocking a training crawler costs almost no traffic; blocking a search/answer crawler removes you from that engine's citations. The publisher blocks in the table above are deliberate licensing decisions. The danger is replicating them by pasting a "block all AI" snippet you found online and silently deleting yourself from ChatGPT Search, Perplexity and Claude. The trap is reusing a broad "block all AI" snippet without checking which bots it actually covers, so a training block quietly takes out the search and answer crawlers too.
And remember: robots.txt is a request. Independent monitoring through 2025 reported that some AI operators fetched disallowed pages anyway. If you need a hard block, enforce it at the edge (WAF, verified IP ranges), not in a text file.
🧪 How to test it
- Read the live file. Open
yourdomain.com/robots.txtin a browser, that's exactly what crawlers see. - Inspect a URL in Search Console. The URL Inspection tool reports whether a specific URL is blocked by robots.txt for Googlebot.
- Use a validator. Google's open-source robots.txt parser (the basis for its real behaviour) and several third-party testers will tell you whether a given URL/user-agent pair is allowed.
- Confirm assets render. Use the URL Inspection "view rendered page" to ensure CSS/JS aren't blocked.
❓ robots.txt FAQ
No. It stops crawling. A disallowed URL can still be indexed (without a snippet) if other pages link to it. Use noindex to keep a page out of search.
Not for Google, which uses the most-specific rule regardless of order. Yes for first-match crawlers. Write your file to be correct under both.
The paths are (/Page ≠ /page); the directive keywords (User-agent, Disallow) are not.
Yes, add an Allow for the specific file. It's more specific than the folder Disallow, so it wins for Google; place it first for legacy crawlers too.
No. Google ignores it; control Googlebot's rate via Search Console settings. Bing and Yandex do honour it.
Google reads up to 500 KB; anything beyond is ignored. Keep it lean.
No. It's per-origin. blog.example.com and shop.example.com each need their own, as do http and https.
Crawlers assume everything is allowed. A missing file (returning 404) is treated as "crawl freely", which is often fine.
An advanced technical audit checks your crawl directives, rendering, and AI-bot access, and tells you exactly what to change, with the receipts.
Claude Vincent is a technical SEO consultant focused on crawlability, rendering, and AI-search visibility. He writes the field guides and case studies at SEO ProCheck, with a bias toward the durable, unglamorous work that decides whether search engines and AI answer engines can actually read and cite a site.
About SEO ProCheck
Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.
Work With Me
Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.







