HTML Document Over 15MB: Google's Crawl Limit and How to Fix It

November 27, 2023
Indexation, Page Size

No Comments

Html document over 15mb: google's crawl limit and how to fix it

Quick version: this flag means the raw HTML of a single URL weighs more than 15MB. Google fetches the first 15MB of an HTML file and throws the rest away, so anything past that byte cutoff is never parsed for that page. Real pages almost never get this fat on markup alone, so treat a genuine hit as a red flag that something enormous is baked directly into the document.

What this check flags

It measures the byte size of the HTML response itself for one URL. Not the images, not the CSS, not the JavaScript files the page pulls in afterward. Just the markup that comes back in the initial document request. When that single response crosses roughly 15MB, this check fires.

Google's own documentation is blunt about the behavior: Googlebot fetches up to the first 15MB of an HTML or text file, then stops reading that resource. Referenced assets are separate requests and count against their own budgets, not this one. So a page loading 40MB of hero video is fine here. A page whose markup is 18MB is not.

A real example

The classic offender is inline base64. A team building an email-template preview tool rendered every uploaded image straight into the page as a data:image/png;base64,... string instead of a real <img src>. One preview page carried 300 embedded images, each ballooned about 33% by base64 encoding. The HTML document came back at 21MB. Everything below roughly the 15MB mark got sliced off by Googlebot, including the footer navigation and the FAQ schema that sat at the bottom of the template.

The fix took an afternoon. They swapped the base64 blobs for real image files served from a normal URL:

<!-- before: markup carries the whole image -->
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...(2MB of text)...">

<!-- after: markup carries a 60-byte reference -->
<img src="/img/previews/template-042.png" width="600" height="400" alt="Template preview">

The document dropped from 21MB to about 90KB. The images still loaded — they were now separate requests that never touched the 15MB HTML ceiling — and the footer links and schema were back in the parsed document.

What actually pushes an HTML document over 15MB

Source of bloat	Typical size added	Lives in the HTML document?
Inline base64 images (data URIs)	1–3MB each, adds up fast	Yes — counts fully
Server-rendered list with no pagination (10k+ rows)	2–10MB	Yes — counts fully
Giant inline JSON / state blob (SSR hydration)	1–8MB	Yes — counts fully
Inline SVG sprites pasted into the body	0.5–4MB	Yes — counts fully
Inline <style> and <script> that should be external	0.2–2MB	Yes — counts fully
Referenced images, video, fonts, external JS/CSS	any size	No — separate requests

How to detect it on your own site

curl the raw document size. The fastest check. curl -s -o /dev/null -w "%{size_download} bytesn" -A "Googlebot" https://example.com/suspect-page/ returns the byte size of the HTML response only. Anything over ~15,000,000 is a real problem. Under a megabyte and you can move on.
Screaming Frog. Crawl the site, then sort the Internal tab by the Size (bytes) column, which reports the HTML document size, not the total page weight. Every URL near or above 15MB rises to the top. In practice you are usually looking at a handful of templated pages, not the whole site.
Chrome DevTools. Open the page, go to the Network tab, reload, and click the very first document request (the HTML). The Size shown there is your document size. Filter to Doc to isolate it from the asset requests.
Search Console URL Inspection. If you suspect truncation, use View crawled page and scroll the fetched HTML. If your closing tags, footer, or schema are missing from what Google fetched, the tail got cut.

How to fix it

Kill inline base64 images. This is the number-one cause. Move them to real files with <img src>. The image still loads; it just stops counting against the document.
Externalize inline scripts and styles. Move large inline <script> and <style> blocks into .js and .css files. Those become separate requests immediately.
Paginate or lazy-load runaway lists. A page dumping every product, comment, or table row into the initial HTML should paginate, or load additional rows on scroll. Nobody — user or crawler — benefits from 20,000 rows in one response.
Shrink SSR state blobs. Frameworks that inline the full app state for hydration can produce megabytes of JSON. Trim what you serialize to only what the page needs on first paint.
Re-fetch and confirm. curl the document again after each change. You want the number well under 15MB with headroom, not sitting at 14.8MB.

Note that gzip does not save you here. Google measures the uncompressed document size, so a 20MB file that compresses to 2MB over the wire is still a 20MB document as far as the 15MB rule is concerned. Compression helps your users' bandwidth; it does nothing for this limit.

FAQ

Is 15MB actually a lot for an HTML file?

Enormous. Google notes the median HTML document is around 30KB — roughly 500 times smaller than the cap. A page hitting 15MB of pure markup is carrying something it should not, which is why this flag is worth investigating rather than dismissing.

Do images and video count against the 15MB?

No, as long as they are referenced with src rather than inlined as data URIs. Referenced media is fetched in its own request. Only bytes that arrive inside the HTML document itself count — which is exactly why inline base64 is so dangerous.

What happens to the content past 15MB?

Googlebot keeps the first 15MB and discards the rest of that response. Whatever sits beyond the cut — body copy, links, structured data — is never parsed for that URL. If your important content lives at the bottom of a bloated document, Google may simply never see it.

Could this be a false positive?

Occasionally. A crawler can misreport size if it captured a redirect chain or an error page, and a one-off timeout can inflate a reading. Confirm with a direct curl before you spend engineering time. If curl shows a normal-sized document, the flag was noise.

Does this differ from a general "page is too heavy" warning?

Yes. This check is specifically the HTML document byte size against Google's 15MB fetch ceiling. A broader page-weight or index-bloat concern is about total transfer size and crawl efficiency across many URLs, which is a different problem with different fixes.

Related checks

If markup bloat is a symptom of a bigger crawl-efficiency issue, read crawl budget: what it is and when you should actually care. When large pages come from client-side frameworks, how search engines and AI crawlers render your pages explains why SSR state blobs get so big. And to confirm what Google actually fetched versus what you shipped, the Search Console page indexing report is where truncation shows up.

Claude Vincent

Claude Vincent is a technical SEO consultant focused on crawlability, rendering, and AI-search visibility. He writes the field guides and case studies at SEO ProCheck, with a bias toward the durable, unglamorous work that decides whether search engines and AI answer engines can actually read and cite a site.

Batch Check, High Priority, HTTP Requests

About SEO ProCheck

Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.

Learn more about me

Work With Me

Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.

Contact now

Subscribe to our newsletter!

More from our blog

Diagram of the agent-readable file stack showing AGENTS.md in the code repository read by coding agents, llms.txt and llms-full.txt at the website root read by answer engines, and robots.txt plus RSL as the access and licensing layer beneath both.

Prev. Post