Open your server logs for a single day and you will find a parade of robots you never invited: GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Bytespider, and a dozen more. Each one wants something different, each one reacts differently when you block it, and most site owners treat all of them as a single switch labelled "AI." That is exactly how good pages quietly vanish from AI answers while their owners congratulate themselves on "blocking the scrapers."
This is the map I wish every client had before they touched their robots.txt. It covers every major AI user-agent crawling the open web in 2026, what each one actually does, and the specific thing that breaks when you say no.
robots.txt + server respondThere is no single "AI bot." There are roughly twenty named user-agents doing three different jobs: training models, indexing pages to answer questions, and fetching a page live for one user. Blocking a training crawler costs you almost no traffic. Blocking a search/answer crawler removes you from AI citations entirely. Knowing which is which is the whole game.
~20 named AI user-agents crawl the open web today | 3 jobs every bot does exactly one: train, index-to-answer, or fetch-for-a-user | 1 line in robots.txt can erase you from AI answers while search stays untouched |
💡 The only mental model you need: three jobs
Every AI crawler on the planet is doing one of three jobs. Sort each bot into the right bucket and the entire "should I block this?" question answers itself.
- Training crawlers harvest text to teach a model. They visit once in a while, take a copy, and leave. Blocking them protects your content from future model training but costs you essentially zero live visibility, the model was never going to link back to you anyway.
- Search & answer crawlers build the live index an assistant searches when a user asks a question. These are the bots that cite you. Block one and you remove yourself from that engine's answers, the AI equivalent of deleting yourself from Google's index.
- User-fetch agents retrieve a single page in real time because a specific human asked the assistant to read that URL. Block them and "summarise this link" requests to your pages simply fail.
The expensive mistake is treating all three as the same threat. People who paste a "block all AI" snippet to stop model training also quietly delete themselves from ChatGPT Search, Claude, and Perplexity, the surfaces actually sending qualified readers in 2026.
🤖 The map: every major AI bot in 2026
Grouped by operator. "Honors robots.txt" reflects each operator's published policy; the asterisks matter, and I unpack them in the limitations section.
| User-agent | Operator | Job | robots.txt | What blocking costs you |
|---|---|---|---|---|
GPTBot | OpenAI | Train | Yes | Your text won't train GPT models |
OAI-SearchBot | OpenAI | Index → answer | Yes | You disappear from ChatGPT Search citations |
ChatGPT-User | OpenAI | User fetch | Yes | "Open this link" requests fail |
ClaudeBot | Anthropic | Train | Yes | Your text won't train Claude |
Claude-SearchBot | Anthropic | Index → answer | Yes | You're absent from Claude's web citations |
Claude-User | Anthropic | User fetch | Yes | User-initiated fetches fail |
PerplexityBot | Perplexity | Index → answer | Yes* | You vanish from Perplexity citations |
Perplexity-User | Perplexity | User fetch | No* | Usually still fetches (user-initiated) |
Googlebot | Index → Search + AI Overviews | Yes | You leave Google Search entirely, AI Overviews can't be blocked alone | |
Google-Extended | Token: Gemini training | Token | Opt out of Gemini training/grounding; Search unaffected | |
Bingbot | Microsoft | Index → Search + Copilot | Yes | You leave Bing and Copilot answers |
Applebot | Apple | Index → Siri/Spotlight | Yes | Absent from Siri and Spotlight results |
Applebot-Extended | Apple | Token: Apple Intelligence training | Token | Opt out of Apple AI training; Search unaffected |
Amazonbot | Amazon | Index → Alexa answers | Yes | Absent from Alexa answers |
Meta-ExternalAgent | Meta | Train | Yes | Your text won't train Meta AI |
CCBot | Common Crawl | Train (open dataset) | Yes | Less inclusion in a dataset many labs train on |
Bytespider | ByteDance | Train | No* | May keep crawling regardless |
DuckAssistBot | DuckDuckGo | Index → DuckAssist | Yes | Absent from DuckAssist answers |
A "Token" entry is a directive in robots.txt, not a crawler with its own user-agent, Google-Extended and Applebot-Extended govern training only and never affect how you rank in their search products.
🚦 What actually breaks when you block each type
Here is the part the "block all AI" crowd never thinks through. The three jobs carry wildly different costs.
You keep your work out of future model training. You lose almost no live traffic, because training crawlers don't send readers, they don't even cite. This is the safe, defensible block if your concern is "don't train on my IP without asking." Pair it with the -Extended tokens to cover Gemini and Apple Intelligence.
This is self-sabotage dressed as caution. These bots build the index the assistant searches at answer time. Disallow them and you are simply not a candidate to be cited, your competitor who left them allowed gets the mention and the click. If you want to appear in AI answers, these must stay open.
These fire when a real person pastes your URL and asks the assistant to read it. Block them and that person gets "I couldn't access that page." It rarely moves rankings, but it is a bad look at the exact moment someone is trying to engage with your content directly.
🔧 robots.txt recipes you can paste today
Multiple User-agent lines can share one Disallow block, that is valid syntax, not a hack.
Recipe 1, Stay in AI answers, opt out of model training. The configuration most publishers actually want:
# Let the answer engines cite you
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Allow: /
# Opt out of model training
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Meta-ExternalAgent
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /Recipe 2, Block everything AI, accept the cost. Legitimate for paywalled or licensing-sensitive sites, as long as you know you're leaving every AI answer surface:
User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: PerplexityBot
User-agent: CCBot
User-agent: Bytespider
User-agent: Amazonbot
User-agent: Meta-ExternalAgent
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /Notice what is not on that list: Googlebot and Bingbot. You cannot opt out of Google AI Overviews or Bing Copilot without also leaving classic search, those answer features are fed by the same index that ranks you. Google-Extended only governs Gemini training, not AI Overviews. Anyone who tells you otherwise is selling something.
🚧 What this map does not mean
Honesty is the difference between a field guide and a fairy tale. Five caveats you should hold onto:
- robots.txt is a request, not a firewall. Compliance is voluntary. Independent monitoring (notably by Cloudflare in 2025) reported that some operators, Perplexity and ByteDance's Bytespider among the most cited, fetched pages that disallowed them. The asterisks in the table flag exactly those disputed cases. If you need a hard block, enforce it at the edge with WAF rules and verified IP ranges, not robots.txt alone.
- User-agents can be spoofed. Anyone can send a request claiming to be
ClaudeBot. The real operators publish IP ranges (and some support reverse-DNS verification) so you can confirm a bot is genuine before trusting, or rate-limiting, it. - Allowing a crawler is necessary, not sufficient. Letting
OAI-SearchBotin does not guarantee a citation. It makes you eligible. Whether you're actually quoted still depends on relevance, clarity, and whether your content survives rendering, which is a separate problem entirely. - The list changes. These operators add and rename agents regularly (the
-SearchBotvariants are relatively new). Treat any AI-crawler list, including this one, as a snapshot, re-check the official docs before a migration. - Blocking is reversible, lost ground is slower. If you disallow a search crawler, re-allowing it doesn't instantly restore your presence, the engine has to re-crawl and re-index on its own schedule.
✅ Audit your own site in ten minutes
Don't take my table on faith, go look at what's actually happening on your server. The replication steps:
- Read your live
robots.txt. Visityourdomain.com/robots.txtand find everyUser-agentline. Map each one to the table above and ask: am I blocking a training bot (fine) or an answer bot (probably a mistake)? - Grep your access logs for one week. Filter for the user-agents in the table. You'll see who actually visits, how often, and which ones you're serving
200versus403. - Check for accidental edge blocks. Many sites block AI bots at Cloudflare or a WAF without realising it, the "Block AI Scrapers" toggle is one click. Confirm your answer crawlers aren't being stopped before they reach
robots.txt. - Test a live fetch. Ask ChatGPT or Claude to open one of your URLs. If it returns "couldn't access," your user-fetch agents are blocked somewhere in the stack.
- Decide on purpose. Write down, per operator, whether you want to be trained on, indexed for answers, or fetched live, then make
robots.txtsay exactly that. Default-by-accident is how good pages disappear.
An AI-visibility audit maps your real crawler access, robots.txt, edge rules, and rendering, against the answer engines that matter, and tells you exactly what to change.
Claude Vincent is a technical SEO consultant focused on crawlability, rendering, and AI-search visibility. He writes the field guides and case studies at SEO ProCheck, with a bias toward the durable, unglamorous work that decides whether search engines and AI answer engines can actually read and cite a site.
About SEO ProCheck
Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.
Work With Me
Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.







