The Forgotten HTML: What AI Crawlers Really See on Your Expensive Website

No Comments

You can tell when a website cost a fortune. The type is set just right, the animations are buttery, every pixel is in its place. Then you fetch that same page the way an AI crawler does — raw HTML, no JavaScript, no negotiation — and half the time you get nothing. An empty shell. A loading spinner. Or a door slammed in your face with a 403.

This is the forgotten HTML: the version of your site that machines actually receive, which almost nobody on the team has ever looked at. The polish is for human eyes. The bots get something else entirely — and the bots are increasingly who decide whether you exist in an AI answer.

The uncomfortable truth

Most AI crawlers do not run JavaScript. Googlebot renders (slowly, expensively); almost nothing else does. In its analysis of over 500 million GPTBot fetches, Vercel found zero evidence of JavaScript execution (Vercel: The rise of the AI crawler). GPTBot, ClaudeBot and PerplexityBot grab the raw HTML and move on. So if your content only appears after JavaScript runs in a browser — or if your server never lets the bot in at all — you are invisible to the systems answering billions of questions a month.

Think I'm exaggerating? Here are the receipts.

I am not asking you to trust me. Open a terminal and run this against any site:

curl -sI -A "GPTBot/1.0" https://example.com/ | head -1
# and to count what the bot actually receives:
curl -s -A "GPTBot/1.0" https://example.com/ | sed 's/<[^>]*>/ /g' | wc -w

I ran exactly that against a stack of polished, well-funded sites in June 2026. Five of them never even let the crawler through the door — they answered GPTBot with an HTTP 403:

SiteWhat GPTBot got
Airbnb403 — blocked
Coinbase403 — blocked
Medium403 — blocked
Product Hunt403 — blocked
Udemy403 — blocked
Crunchbase3,347 words ✅
Robinhood1,677 words ✅
Linear1,572 words ✅
Wikipedia6,657 words ✅

Boom. Run the command yourself — you will get the same thing. Now, some of those blocks are deliberate (plenty of publishers block AI crawlers on purpose, and that is a perfectly valid choice). The problem is that most teams have no idea which camp they are in. A Cloudflare bot rule or an over-zealous WAF will quietly 403 GPTBot, ClaudeBot and PerplexityBot while everyone congratulates themselves on the new design. You meant to win AI visibility; you accidentally bricked it.

What's actually happening

Expensive sites go invisible to AI in three ways, and you cannot see any of them from the front-end:

  1. The empty shell. A client-side-rendered build ships near-zero content in the raw HTML; everything is assembled by JavaScript the bot never runs.
  2. The missing money content. The shell renders fine, the word count looks healthy — but prices, reviews, specs, and anything behind a tab or a "load more" button are injected later. The bot sees your product page but not its price.
  3. The locked door. The server returns a 403 (or a CAPTCHA, or a JS challenge) to the crawler's user agent. Content is irrelevant when the bot never gets in.

I've spent a lot of time in this exact mess

I will be honest: a good chunk of what I know here came from cleaning it up. Across several projects I have sat between development, content, and UX teams untangling the same gap — content the CMS swore was published, that looked perfect in the browser, that Search Console even showed rendered, and that the AI crawlers simply never received. It is almost never one person's fault. The dev team optimized for a slick client-side experience, the content team wrote great copy, the UX team made it beautiful — and nobody owned the question of what the raw HTML contained. That question is the whole game now.

Check your own site in 30 seconds

The bookmarklet (drag it to your bookmarks bar)

New bookmark, paste this as the URL, click it on any of your pages. It compares the raw HTML a bot receives to what you see and tells you the percentage of content the bots actually get:

javascript:(async()=>{try{const r=await fetch(location.href,{cache:'no-store',credentials:'omit'});const raw=await r.text();const d=new DOMParser().parseFromString(raw,'text/html');d.querySelectorAll('script,style,noscript,template').forEach(e=>e.remove());const rawT=(d.body?d.body.textContent:'').replace(/s+/g,' ').trim();const renT=document.body.innerText.replace(/s+/g,' ').trim();const rw=rawT?rawT.split(' ').length:0;const rn=renT?renT.split(' ').length:0;const p=rn?Math.round(100*rw/rn):0;alert('AI-CRAWLER VISIBILITY CHECKn'+location.href+'nnWords in RAW HTML (what GPTBot/ClaudeBot/PerplexityBot get): '+rw+'nWords after JS renders (what you see): '+rn+'nnNon-rendering AI bots see ~'+p+'% of your content.n'+(p<70?'WARNING: a large share of your content is injected by JavaScript and is INVISIBLE to AI crawlers.':'OK: most of your content is in the raw HTML.'));}catch(e){alert('Could not fetch the raw HTML (cross-origin or CSP). Run it on your own site.n'+e);}})();

The rest of the toolkit

  • View Source vs Inspect: Ctrl/Cmd+U (what the bot gets) vs DevTools → Elements (what you see). Content in one and not the other is your answer.
  • View Rendered Source — Raw / Rendered / Difference, line by line.
  • Quick Javascript Switcher — turn JS off, reload; blank page, blank bot.
  • Wappalyzer — identifies the framework so you know your risk.
  • The curl receipts above — check each bot you care about: GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended.

How to fix it

  1. Render on the server — SSR (Next.js, Nuxt, SvelteKit) or static generation (SSG). Real HTML, immediately, for bots and humans.
  2. Put the money content in the HTML — prices, reviews, specs and key copy in the initial payload. Let JavaScript enhance, not create.
  3. Unlock the door on purpose. Audit robots.txt and your WAF/CDN rules so the assistants you want citing you are not silently 403'd. If you block AI crawlers, make it a decision, not an accident.
  4. Re-check with the curl commands and the bookmarklet until the raw HTML contains your content and the bots get a 200.

The takeaway

A site can look like a million dollars and hand a machine a blank page or a closed door — and the prettier the build, the easier it is to never notice. Looks are for humans. The forgotten HTML is for machines. Go read yours.

Related reading

Want to know what AI crawlers actually see on your site?

I run advanced technical audits that check rendering, retrievability, and bot access — the things that decide whether you show up in search and AI answers at all. See how an advanced SEO audit works →

    About SEO ProCheck

    Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.

    Work With Me

    Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.

    Subscribe to our newsletter!

    More from our blog