The AI Crawler Map: Every Bot Reading Your Site in 2026

May 28, 2026
AI Search

No Comments

The ai crawler map: every bot reading your site in 2026

Open your server logs for a single day and you will find a parade of robots you never invited: GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Bytespider, and a dozen more. Each one wants something different, each one reacts differently when you block it, and most site owners treat all of them as a single switch labelled "AI." That is exactly how good pages quietly vanish from AI answers while their owners congratulate themselves on "blocking the scrapers."

This is the map I wish every client had before they touched their robots.txt. It covers every major AI user-agent crawling the open web in 2026, what each one actually does, and the specific thing that breaks when you say no.

How a page becomes an AI citation, or doesn't

A user asks an AI engine

→

The engine sends a fetch bot

→

robots.txt + server respond

→

Page rendered, text extracted

✓ Allowed → you're cited in the answer

✕ Blocked / 403 → absent from the answer

Block the wrong user-agent and you remove yourself from the answer, the page still ranks in classic search.

🗺️ TL;DR

There is no single "AI bot." There are roughly twenty named user-agents doing three different jobs: training models, indexing pages to answer questions, and fetching a page live for one user. Blocking a training crawler costs you almost no traffic. Blocking a search/answer crawler removes you from AI citations entirely. Knowing which is which is the whole game.

~20

named AI user-agents crawl the open web today

3 jobs

every bot does exactly one: train, index-to-answer, or fetch-for-a-user

1 line

in robots.txt can erase you from AI answers while search stays untouched

💡 The only mental model you need: three jobs

Every AI crawler on the planet is doing one of three jobs. Sort each bot into the right bucket and the entire "should I block this?" question answers itself.

Training crawlers harvest text to teach a model. They visit once in a while, take a copy, and leave. Blocking them protects your content from future model training but costs you essentially zero live visibility, the model was never going to link back to you anyway.
Search & answer crawlers build the live index an assistant searches when a user asks a question. These are the bots that cite you. Block one and you remove yourself from that engine's answers, the AI equivalent of deleting yourself from Google's index.
User-fetch agents retrieve a single page in real time because a specific human asked the assistant to read that URL. Block them and "summarise this link" requests to your pages simply fail.

The expensive mistake is treating all three as the same threat. People who paste a "block all AI" snippet to stop model training also quietly delete themselves from ChatGPT Search, Claude, and Perplexity, the surfaces actually sending qualified readers in 2026.

🤖 The map: every major AI bot in 2026

Grouped by operator. "Honors robots.txt" reflects each operator's published policy; the asterisks matter, and I unpack them in the limitations section.

User-agent	Operator	Job	robots.txt	What blocking costs you
`GPTBot`	OpenAI	Train	Yes	Your text won't train GPT models
`OAI-SearchBot`	OpenAI	Index → answer	Yes	You disappear from ChatGPT Search citations
`ChatGPT-User`	OpenAI	User fetch	Yes	"Open this link" requests fail
`ClaudeBot`	Anthropic	Train	Yes	Your text won't train Claude
`Claude-SearchBot`	Anthropic	Index → answer	Yes	You're absent from Claude's web citations
`Claude-User`	Anthropic	User fetch	Yes	User-initiated fetches fail
`PerplexityBot`	Perplexity	Index → answer	Yes*	You vanish from Perplexity citations
`Perplexity-User`	Perplexity	User fetch	No*	Usually still fetches (user-initiated)
`Googlebot`	Google	Index → Search + AI Overviews	Yes	You leave Google Search entirely, AI Overviews can't be blocked alone
`Google-Extended`	Google	Token: Gemini training	Token	Opt out of Gemini training/grounding; Search unaffected
`Bingbot`	Microsoft	Index → Search + Copilot	Yes	You leave Bing and Copilot answers
`Applebot`	Apple	Index → Siri/Spotlight	Yes	Absent from Siri and Spotlight results
`Applebot-Extended`	Apple	Token: Apple Intelligence training	Token	Opt out of Apple AI training; Search unaffected
`Amazonbot`	Amazon	Index → Alexa answers	Yes	Absent from Alexa answers
`Meta-ExternalAgent`	Meta	Train	Yes	Your text won't train Meta AI
`CCBot`	Common Crawl	Train (open dataset)	Yes	Less inclusion in a dataset many labs train on
`Bytespider`	ByteDance	Train	No*	May keep crawling regardless
`DuckAssistBot`	DuckDuckGo	Index → DuckAssist	Yes	Absent from DuckAssist answers

A "Token" entry is a directive in robots.txt, not a crawler with its own user-agent, Google-Extended and Applebot-Extended govern training only and never affect how you rank in their search products.

🚦 What actually breaks when you block each type

Here is the part the "block all AI" crowd never thinks through. The three jobs carry wildly different costs.

Block a training crawler (GPTBot, ClaudeBot, CCBot, Bytespider) → low cost

You keep your work out of future model training. You lose almost no live traffic, because training crawlers don't send readers, they don't even cite. This is the safe, defensible block if your concern is "don't train on my IP without asking." Pair it with the -Extended tokens to cover Gemini and Apple Intelligence.

Block a search/answer crawler (OAI-SearchBot, Claude-SearchBot, PerplexityBot) → high cost

This is self-sabotage dressed as caution. These bots build the index the assistant searches at answer time. Disallow them and you are simply not a candidate to be cited, your competitor who left them allowed gets the mention and the click. If you want to appear in AI answers, these must stay open.

Block a user-fetch agent (ChatGPT-User, Claude-User) → friction cost

These fire when a real person pastes your URL and asks the assistant to read it. Block them and that person gets "I couldn't access that page." It rarely moves rankings, but it is a bad look at the exact moment someone is trying to engage with your content directly.

🔧 robots.txt recipes you can paste today

Multiple User-agent lines can share one Disallow block, that is valid syntax, not a hack.

Recipe 1, Stay in AI answers, opt out of model training. The configuration most publishers actually want:

# Let the answer engines cite you
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Allow: /

# Opt out of model training
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Meta-ExternalAgent
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /

Recipe 2, Block everything AI, accept the cost. Legitimate for paywalled or licensing-sensitive sites, as long as you know you're leaving every AI answer surface:

User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-SearchBot
User-agent: Claude-User
User-agent: PerplexityBot
User-agent: CCBot
User-agent: Bytespider
User-agent: Amazonbot
User-agent: Meta-ExternalAgent
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /

Notice what is not on that list: Googlebot and Bingbot. You cannot opt out of Google AI Overviews or Bing Copilot without also leaving classic search, those answer features are fed by the same index that ranks you. Google-Extended only governs Gemini training, not AI Overviews. Anyone who tells you otherwise is selling something.

🚧 What this map does not mean

Honesty is the difference between a field guide and a fairy tale. Five caveats you should hold onto:

robots.txt is a request, not a firewall. Compliance is voluntary. Independent monitoring (notably by Cloudflare in 2025) reported that some operators, Perplexity and ByteDance's Bytespider among the most cited, fetched pages that disallowed them. The asterisks in the table flag exactly those disputed cases. If you need a hard block, enforce it at the edge with WAF rules and verified IP ranges, not robots.txt alone.
User-agents can be spoofed. Anyone can send a request claiming to be ClaudeBot. The real operators publish IP ranges (and some support reverse-DNS verification) so you can confirm a bot is genuine before trusting, or rate-limiting, it.
Allowing a crawler is necessary, not sufficient. Letting OAI-SearchBot in does not guarantee a citation. It makes you eligible. Whether you're actually quoted still depends on relevance, clarity, and whether your content survives rendering, which is a separate problem entirely.
The list changes. These operators add and rename agents regularly (the -SearchBot variants are relatively new). Treat any AI-crawler list, including this one, as a snapshot, re-check the official docs before a migration.
Blocking is reversible, lost ground is slower. If you disallow a search crawler, re-allowing it doesn't instantly restore your presence, the engine has to re-crawl and re-index on its own schedule.

✅ Audit your own site in ten minutes

Don't take my table on faith, go look at what's actually happening on your server. The replication steps:

Read your live robots.txt. Visit yourdomain.com/robots.txt and find every User-agent line. Map each one to the table above and ask: am I blocking a training bot (fine) or an answer bot (probably a mistake)?
Grep your access logs for one week. Filter for the user-agents in the table. You'll see who actually visits, how often, and which ones you're serving 200 versus 403.
Check for accidental edge blocks. Many sites block AI bots at Cloudflare or a WAF without realising it, the "Block AI Scrapers" toggle is one click. Confirm your answer crawlers aren't being stopped before they reach robots.txt.
Test a live fetch. Ask ChatGPT or Claude to open one of your URLs. If it returns "couldn't access," your user-fetch agents are blocked somewhere in the stack.
Decide on purpose. Write down, per operator, whether you want to be trained on, indexed for answers, or fetched live, then make robots.txt say exactly that. Default-by-accident is how good pages disappear.

Not sure which bots your site is quietly turning away?

An AI-visibility audit maps your real crawler access, robots.txt, edge rules, and rendering, against the answer engines that matter, and tells you exactly what to change.

Request an advanced SEO & AI-visibility audit →

📡 AI Crawler Visibility series

Claude Vincent

Claude Vincent is a technical SEO consultant focused on crawlability, rendering, and AI-search visibility. He writes the field guides and case studies at SEO ProCheck, with a bias toward the durable, unglamorous work that decides whether search engines and AI answer engines can actually read and cite a site.

About SEO ProCheck

Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.

Learn more about me

Work With Me

Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.

Contact now

The AI Crawler Map: Every Bot Reading Your Site in 2026

💡 The only mental model you need: three jobs

🤖 The map: every major AI bot in 2026

🚦 What actually breaks when you block each type

🔧 robots.txt recipes you can paste today

🚧 What this map does not mean

✅ Audit your own site in ten minutes

About SEO ProCheck

Work With Me

Subscribe to our newsletter!

More from our blog

AGENTS.md vs llms.txt vs llms-full.txt: Which Agent File Does What

Profound vs Semrush and Ahrefs: What an AI-Search Tool Actually Replaces (and What It Doesn't)

SEO vs AEO vs GEO: What Each One Means and How They Actually Differ

Google May 2026 Core Update: What We Learned After the Dust Settled

Pogosticking: The Click Pattern That Quietly Decides Who Ranks

Interaction to Next Paint (INP): The Complete Guide

SSR vs CSR: Why Rendering Decides Whether AI Can Read Your Site

Which AI Bots Are You Actually Blocking? (GPTBot, ClaudeBot, Perplexity & More)

Recent Posts

The AI Crawler Map: Every Bot Reading Your Site in 2026

💡 The only mental model you need: three jobs

🤖 The map: every major AI bot in 2026

🚦 What actually breaks when you block each type

🔧 robots.txt recipes you can paste today

🚧 What this map does not mean

✅ Audit your own site in ten minutes

About SEO ProCheck

Work With Me

Subscribe to our newsletter!

More from our blog

Recent Posts

All Website Tags