The AI Crawler Map: Every Bot Reading Your Site in 2026

No Comments

Open your server logs for a single day and you will find a parade of robots you never invited: GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Bytespider, and a dozen more. Each one wants something different, each one reacts differently when you block it, and most site owners treat all of them as a single switch labelled "AI." That is exactly how good pages quietly vanish from AI answers while their owners congratulate themselves on "blocking the scrapers."

This is the map I wish every client had before they touched their robots.txt. It covers every major AI user-agent crawling the open web in 2026, what each one actually does, and the specific thing that breaks when you say no.

How a page becomes an AI citation, or doesn't
A user asks an AI engine
The engine sends a fetch bot
robots.txt + server respond
Page rendered, text extracted
✓ Allowed → you're cited in the answer
✕ Blocked / 403 → absent from the answer
Block the wrong user-agent and you remove yourself from the answer, the page still ranks in classic search.
🗺️ TL;DR

There is no single "AI bot." There are roughly twenty named user-agents doing three different jobs: training models, indexing pages to answer questions, and fetching a page live for one user. Blocking a training crawler costs you almost no traffic. Blocking a search/answer crawler removes you from AI citations entirely. Knowing which is which is the whole game.

~20
named AI user-agents crawl the open web today
3 jobs
every bot does exactly one: train, index-to-answer, or fetch-for-a-user
1 line
in robots.txt can erase you from AI answers while search stays untouched

💡 The only mental model you need: three jobs

Every AI crawler on the planet is doing one of three jobs. Sort each bot into the right bucket and the entire "should I block this?" question answers itself.

  • Training crawlers harvest text to teach a model. They visit once in a while, take a copy, and leave. Blocking them protects your content from future model training but costs you essentially zero live visibility, the model was never going to link back to you anyway.
  • Search & answer crawlers build the live index an assistant searches when a user asks a question. These are the bots that cite you. Block one and you remove yourself from that engine's answers, the AI equivalent of deleting yourself from Google's index.
  • User-fetch agents retrieve a single page in real time because a specific human asked the assistant to read that URL. Block them and "summarise this link" requests to your pages simply fail.

The expensive mistake is treating all three as the same threat. People who paste a "block all AI" snippet to stop model training also quietly delete themselves from ChatGPT Search, Claude, and Perplexity, the surfaces actually sending qualified readers in 2026.

🤖 The map: every major AI bot in 2026

Grouped by operator. "Honors robots.txt" reflects each operator's published policy; the asterisks matter, and I unpack them in the limitations section.

User-agentOperatorJobrobots.txtWhat blocking costs you
GPTBotOpenAITrainYesYour text won't train GPT models
OAI-SearchBotOpenAIIndex → answerYesYou disappear from ChatGPT Search citations
ChatGPT-UserOpenAIUser fetchYes"Open this link" requests fail
ClaudeBotAnthropicTrainYesYour text won't train Claude
Claude-SearchBotAnthropicIndex → answerYesYou're absent from Claude's web citations
Claude-UserAnthropicUser fetchYesUser-initiated fetches fail
PerplexityBotPerplexityIndex → answerYes*You vanish from Perplexity citations
Perplexity-UserPerplexityUser fetchNo*Usually still fetches (user-initiated)
GooglebotGoogleIndex → Search + AI OverviewsYesYou leave Google Search entirely, AI Overviews can't be blocked alone
Google-ExtendedGoogleToken: Gemini trainingTokenOpt out of Gemini training/grounding; Search unaffected
BingbotMicrosoftIndex → Search + CopilotYesYou leave Bing and Copilot answers
ApplebotAppleIndex → Siri/SpotlightYesAbsent from Siri and Spotlight results
Applebot-ExtendedAppleToken: Apple Intelligence trainingTokenOpt out of Apple AI training; Search unaffected
AmazonbotAmazonIndex → Alexa answersYesAbsent from Alexa answers
Meta-ExternalAgentMetaTrainYesYour text won't train Meta AI
CCBotCommon CrawlTrain (open dataset)YesLess inclusion in a dataset many labs train on
BytespiderByteDanceTrainNo*May keep crawling regardless
DuckAssistBotDuckDuckGoIndex → DuckAssistYesAbsent from DuckAssist answers

A "Token" entry is a directive in robots.txt, not a crawler with its own user-agent, Google-Extended and Applebot-Extended govern training only and never affect how you rank in their search products.

🚦 What actually breaks when you block each type

Here is the part the "block all AI" crowd never thinks through. The three jobs carry wildly different costs.

Block a training crawler (GPTBot, ClaudeBot, CCBot, Bytespider) → low cost

You keep your work out of future model training. You lose almost no live traffic, because training crawlers don't send readers, they don't even cite. This is the safe, defensible block if your concern is "don't train on my IP without asking." Pair it with the -Extended tokens to cover Gemini and Apple Intelligence.

Block a search/answer crawler (OAI-SearchBot, Claude-SearchBot, PerplexityBot) → high cost

This is self-sabotage dressed as caution. These bots build the index the assistant searches at answer time. Disallow them and you are simply not a candidate to be cited, your competitor who left them allowed gets the mention and the click. If you want to appear in AI answers, these must stay open.

Block a user-fetch agent (ChatGPT-User, Claude-User) → friction cost

These fire when a real person pastes your URL and asks the assistant to read it. Block them and that person gets "I couldn't access that page." It rarely moves rankings, but it is a bad look at the exact moment someone is trying to engage with your content directly.

🔧 robots.txt recipes you can paste today

Multiple User-agent lines can share one Disallow block, that is valid syntax, not a hack.

Recipe 1, Stay in AI answers, opt out of model training. The configuration most publishers actually want:

# Let the answer engines cite you
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Allow: /

# Opt out of model training
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Meta-ExternalAgent
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /

Recipe 2, Block everything AI, accept the cost. Legitimate for paywalled or licensing-sensitive sites, as long as you know you're leaving every AI answer surface:

User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: PerplexityBot
User-agent: CCBot
User-agent: Bytespider
User-agent: Amazonbot
User-agent: Meta-ExternalAgent
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /

Notice what is not on that list: Googlebot and Bingbot. You cannot opt out of Google AI Overviews or Bing Copilot without also leaving classic search, those answer features are fed by the same index that ranks you. Google-Extended only governs Gemini training, not AI Overviews. Anyone who tells you otherwise is selling something.

🚧 What this map does not mean

Honesty is the difference between a field guide and a fairy tale. Five caveats you should hold onto:

  • robots.txt is a request, not a firewall. Compliance is voluntary. Independent monitoring (notably by Cloudflare in 2025) reported that some operators, Perplexity and ByteDance's Bytespider among the most cited, fetched pages that disallowed them. The asterisks in the table flag exactly those disputed cases. If you need a hard block, enforce it at the edge with WAF rules and verified IP ranges, not robots.txt alone.
  • User-agents can be spoofed. Anyone can send a request claiming to be ClaudeBot. The real operators publish IP ranges (and some support reverse-DNS verification) so you can confirm a bot is genuine before trusting, or rate-limiting, it.
  • Allowing a crawler is necessary, not sufficient. Letting OAI-SearchBot in does not guarantee a citation. It makes you eligible. Whether you're actually quoted still depends on relevance, clarity, and whether your content survives rendering, which is a separate problem entirely.
  • The list changes. These operators add and rename agents regularly (the -SearchBot variants are relatively new). Treat any AI-crawler list, including this one, as a snapshot, re-check the official docs before a migration.
  • Blocking is reversible, lost ground is slower. If you disallow a search crawler, re-allowing it doesn't instantly restore your presence, the engine has to re-crawl and re-index on its own schedule.

✅ Audit your own site in ten minutes

Don't take my table on faith, go look at what's actually happening on your server. The replication steps:

  1. Read your live robots.txt. Visit yourdomain.com/robots.txt and find every User-agent line. Map each one to the table above and ask: am I blocking a training bot (fine) or an answer bot (probably a mistake)?
  2. Grep your access logs for one week. Filter for the user-agents in the table. You'll see who actually visits, how often, and which ones you're serving 200 versus 403.
  3. Check for accidental edge blocks. Many sites block AI bots at Cloudflare or a WAF without realising it, the "Block AI Scrapers" toggle is one click. Confirm your answer crawlers aren't being stopped before they reach robots.txt.
  4. Test a live fetch. Ask ChatGPT or Claude to open one of your URLs. If it returns "couldn't access," your user-fetch agents are blocked somewhere in the stack.
  5. Decide on purpose. Write down, per operator, whether you want to be trained on, indexed for answers, or fetched live, then make robots.txt say exactly that. Default-by-accident is how good pages disappear.
Not sure which bots your site is quietly turning away?

An AI-visibility audit maps your real crawler access, robots.txt, edge rules, and rendering, against the answer engines that matter, and tells you exactly what to change.

Request an advanced SEO & AI-visibility audit →

Claude Vincent is a technical SEO consultant focused on crawlability, rendering, and AI-search visibility. He writes the field guides and case studies at SEO ProCheck, with a bias toward the durable, unglamorous work that decides whether search engines and AI answer engines can actually read and cite a site.

    About SEO ProCheck

    Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.

    Work With Me

    Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.

    Subscribe to our newsletter!

    More from our blog