The Machine-Readable Web: What robots.txt, llms.txt, Schema, WebMCP and Entity Signals Actually Do in 2026
- June 1, 2026
- AI Search
There has never been more advice about "files you must add so AI can read your site." robots.txt, llms.txt, llms-full.txt, Schema, WebMCP, AI opt-out tokens: every week someone sells one of them as the thing standing between you and visibility. Most of that is noise. This is the honest map of every standard that governs how crawlers, search engines, and AI agents read your site in 2026, what each one actually does, how widely it is really used, and which ones move the needle versus which ones are optional or still experimental. No hype, no fear-selling.
Use this as the index. Each standard gets a plain-English summary here and a link to the full reference when you want the depth.
The standards that reliably matter are the boring, well-supported ones: robots.txt for access, meta robots for indexing, Schema and entity signals for meaning. llms.txt is on roughly one in ten sites and the major AI crawlers largely ignore it for content, so treat it as optional and mostly agent-facing. WebMCP is genuinely interesting but in early browser preview as of 2026, so it is a watch-and-experiment item, not a requirement. Optimise the rendered HTML and your entity clarity first. Everything else is a layer on top.
~10% of sites have llms.txt, and AI crawlers largely skip it for content (SE Ranking, 300k domains) | Confirmed Microsoft stated at SMX 2025 that schema markup helps its LLMs understand content | Feb 2026 WebMCP entered early browser preview (Chrome 145), so it is emerging, not established |
Every guide in this series, in one place. Start anywhere.
Access: who is allowed to read you
- The Complete robots.txt Reference: Precedence, Wildcards, AI Bots & Real-World Receipts
- The AI Crawler Map: Every Bot Reading Your Site in 2026
- Which AI Bots Are You Actually Blocking? (GPTBot, ClaudeBot, Perplexity & More)
- Is Your Cloudflare or WAF Secretly Blocking GPTBot?
Discovery: telling AI what exists
Meaning: what your content means and who you are
- The Forgotten HTML: What AI Crawlers Really See on Your Expensive Website
- SSR vs CSR: Why Rendering Decides Whether AI Can Read Your Site
- Structuring Content So AI Can Actually Extract It
- Entity SEO: Helping Search Engines and AI Understand Who You Are
- Entity HTML: How to Become a Machine-Readable Brand (Knowledge Graph, sameAs & the Entitymap)
- FAQ Schema: Why You Still Keep It After Google Retired the Rich Result
- Schema Markup for AI Search: FAQPage, Tables, and Structured Data
Action: what an agent can do on your site
AI engines and getting cited
- Google AI Overviews: How to Optimize for AI Overview Citations
- How to Become a Cited Source in AI Answers
- ChatGPT SEO: How to Optimize for ChatGPT and SearchGPT Citations
- Claude SEO: How to Get Cited in Claude AI
- Perplexity SEO: How to Get Cited in Perplexity AI
- Brand Mentions vs Backlinks: What Matters More for AI Visibility
- How AI Search Tools Source Information from Organic Search Results
- JavaScript and AI Search: Why Server-Side Rendering Matters for GEO
More in this series
📊 Every standard at a glance
| Standard | What it does | Audience | Reliability | Adoption | Verdict |
|---|---|---|---|---|---|
robots.txt | Controls which paths a crawler may request | All crawlers | Voluntary, widely honoured | Universal | Essential |
| Meta robots / X-Robots-Tag | Controls indexing (the real noindex) | Search engines | Authoritative | Universal | Essential |
| XML sitemap | Lists URLs for discovery | Search crawlers | Well supported | Universal | Useful |
| Schema.org structured data | States what content means | Search and AI | Used; some rich results restricted | Widespread | High value |
Entity signals (sameAs, KG) | Defines who you are as a thing | Search and AI | Foundational, confidence-based | Growing | High value, rising |
| AI opt-out tokens (Google-Extended, etc.) | Opt out of AI model training | Specific AI vendors | Honoured by issuer | As needed | Use if opting out |
llms.txt | Proposed index of key content for AI | AI agents, not search | Crawlers largely skip it; Google Search says not needed | ~10% | Optional |
llms-full.txt | Full content concatenated for LLMs | AI agents | Same caveats, rarer | Minimal | Optional |
| WebMCP | Exposes site actions as callable tools for agents | Browser AI agents | Early preview | Minimal | Emerging, watch |
🔓 Layer 1: access (robots.txt and AI tokens)
What it does: robots.txt tells compliant crawlers which paths they may request. AI opt-out tokens like Google-Extended and Applebot-Extended let you decline model training without leaving search. Reliability: high among mainstream crawlers, though compliance is voluntary and some AI bots have been observed ignoring it.
The honest take: this is genuinely essential, and the most common mistake is blocking the wrong thing, for example removing yourself from AI answers by accident. Full detail in the complete robots.txt reference and the per-bot breakdown in the AI Crawler Map.
🗂️ Layer 2: discovery (sitemaps and llms.txt)
XML sitemaps list your URLs so search engines can find them. Boring, universal, worth having. llms.txt is the controversial one: a proposed Markdown file that points AI systems at your key content. Reliability and adoption: here is the no-BS part. As of 2026 it sits on roughly one in ten sites, the major AI crawlers overwhelmingly fetch your HTML directly rather than the file, and Google Search states plainly that it is not needed for AI Overviews or AI Mode. At the same time, Chrome's Lighthouse now checks for it under an agentic-browsing audit, and Anthropic and OpenAI reference it for their agent tooling. So yes: one part of Google now checks for a file another part of Google says you do not need. That is a little WTF, and if you find it confusing, the confusion is on them, not you.
The honest take: it is not "robots.txt for AI," it cannot block anything, and it will not move search visibility. It may help agent and developer-documentation use cases. Treat it as optional. The full, receipt-backed story is in llms.txt explained.
🧠 Layer 3: meaning (Schema and entity signals)
This is where the durable advantage lives, and where most sites underinvest. Schema.org structured data states what your content is in a vocabulary machines share. Microsoft confirmed at SMX 2025 that schema helps its LLMs understand content, and structured data underpins rich results and answer eligibility, even though Google restricted some rich results such as FAQ to government and health sites in 2023. Entity signals go further: they define your brand as a thing in the knowledge graph through Organization schema, sameAs corroboration, and a clear entity home.
The honest take: being a confident, well-described entity is increasingly the prerequisite for being cited at all. Start here. Depth lives in Entity HTML, the rendering angle in The Forgotten HTML, and the schema-specific case in FAQ schema, why you still keep it.
🤝 Layer 4: action (WebMCP)
What it does: WebMCP, the Web Model Context Protocol, is a browser-native API built by Google and Microsoft engineers under the W3C. It lets a site publish a "tool contract," a structured list of actions an AI agent can call directly, so the agent books, searches, or filters by calling your functions instead of guessing at buttons. Reliability and adoption: it entered early preview in February 2026 in Chrome 145, so it is experimental and barely deployed.
The honest take: this is the most forward-looking item on the list and worth understanding now, because the agentic web is where the browser vendors are clearly heading. It is not something you are behind on today. Watch it, prototype if you are technical, and do not let anyone tell you it is make-or-break in 2026. Read the full WebMCP explainer, with working code, here.
✅ The no-BS verdict: what to actually do
If you do nothing else, do these, in order:
- Get access right. A correct robots.txt that lets the engines you want in, including AI answer crawlers, and uses meta robots for real indexing control.
- Be readable. Make sure your content survives rendering, because every layer above is wasted if the crawler sees an empty shell.
- Be a clear entity. Organization schema,
sameAs, a clean entity home. This is the highest-leverage, most under-done work. - Add structured data where it is genuine. Schema for real content, never faked to chase a feature.
- Treat llms.txt and WebMCP as optional and experimental. Add llms.txt if agent or developer audiences matter to you. Watch WebMCP. Neither is a search-visibility lever today.
The pattern across all of it: the durable wins are the unglamorous, well-supported standards plus a clear identity. The new files get the headlines, but they sit on top of fundamentals that most sites still have not finished. If there is any real BS in this space, it is the constant pressure to adopt a shiny new file before the boring, proven basics are even in place.
One caveat worth stating plainly: this space moves fast, and the vendors themselves say the standards for the agentic web are still emerging. Nothing here is settled or "dead." Treat every verdict as current as of 2026 and re-check the primary sources before a major decision.
An advanced audit cuts through the hype: it checks your access, rendering, schema, entity signals, and AI readiness, and hands you a prioritised list of what to fix, with the receipts.
Claude Vincent is a technical SEO consultant focused on crawlability, rendering, and AI-search visibility. He writes the field guides and case studies at SEO ProCheck, with a bias toward the durable, unglamorous work that decides whether search engines and AI answer engines can actually read and cite a site.
About SEO ProCheck
Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.
Work With Me
Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.







