XML Sitemap Strategy Beyond One Big File: Index Sitemaps, Segmentation, and Using lastmod to Steer Crawling

November 12, 2024
Technical SEO

No Comments

Xml sitemap strategy beyond one big file: index sitemaps, segmentation, and using lastmod to steer crawling

If your site ships a single sitemap.xml with every URL dumped into it, you're leaving diagnostic signal on the table. A sitemap isn't just a list you hand Google so it can find pages, it's a controllable input you can segment, measure, and use to read indexation health per content type. Done well, your sitemap structure becomes a dashboard that tells you exactly which template is failing to get indexed and lets you steer crawl attention toward the URLs that changed.

Why one big file hides your real problems

Search Console reports sitemap coverage at the sitemap level. If you submit a single file with 50,000 URLs and it shows "38,000 indexed," that number is useless for decision-making. You can't tell whether your blog is fully indexed while your programmatic landing pages are being ignored, or vice versa. The aggregate masks the variance, and variance is where the actionable insight lives.

The fix is to split your sitemaps along the same lines your site is built: by template or content type. When each template has its own sitemap, Search Console's per-sitemap "Submitted vs. Indexed" counts become a genuine indexation-rate breakdown. A template indexing at 95% is healthy. One indexing at 40% is telling you that those pages are thin, duplicative, orphaned, or otherwise judged not worth keeping. You found the problem in seconds instead of crawling and clustering 50,000 URLs by hand.

How to segment: by template, not by accident

Segment so each sitemap maps to a single content type with a coherent quality and freshness profile. For a typical site that means files like:

sitemap-products.xml, individual product/detail pages
sitemap-categories.xml, listing and faceted hub pages
sitemap-blog.xml, editorial articles
sitemap-pages.xml, static/marketing pages
sitemap-programmatic.xml, any templated, data-driven pages (locations, comparisons, etc.)

Then bind them together with a sitemap index file, an XML file whose entries are <sitemap> references rather than <url> entries. Submit only the index in Search Console and Bing Webmaster Tools; the engines fetch the children automatically.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <sitemap>
 <loc>https://example.com/sitemap-products.xml</loc>
 <lastmod>2026-06-03T09:14:00+00:00</lastmod>
 </sitemap>
 <sitemap>
 <loc>https://example.com/sitemap-blog.xml</loc>
 <lastmod>2026-06-01T17:42:00+00:00</lastmod>
 </sitemap>
</sitemapindex>

Respect the hard limits: each sitemap file holds a maximum of 50,000 URLs and 50MB uncompressed. When a template exceeds that, paginate it, sitemap-products-1.xml, sitemap-products-2.xml, and list all parts in the index. A single index file can reference up to 50,000 sitemaps, and you can nest indexes if you genuinely operate at that scale. Keeping each shard well under the cap also keeps the per-sitemap indexation numbers easy to reason about.

Using lastmod to steer crawling

The <lastmod> element is the most misunderstood and most abused field in the spec, and when used honestly, it's your strongest lever for crawl prioritization. Google has stated it uses lastmod as a signal for scheduling recrawls, but only from sites that populate it accurately. The catch is reciprocal: if your lastmod values are untrustworthy, Google stops trusting the field entirely for your domain, and you lose the lever permanently.

Rules for keeping lastmod a credible signal:

Only update it when the meaningful content actually changes. A new comment, a rotated "related products" widget, or a refreshed timestamp in the footer is not a content change. If you bump lastmod for cosmetic churn, you're crying wolf.
Use a complete, valid date format. W3C Datetime, either 2026-06-03 or the full 2026-06-03T09:14:00+00:00. Include the timezone when you include the time.
Never set every URL's lastmod to today's date on each build. This is the single most common way sites destroy the signal's value. If everything changed today, nothing did.
Propagate lastmod up to the index entry. The index lastmod should reflect the most recent change within that child sitemap, so engines can cheaply decide which children to refetch.

When your lastmod is honest, you effectively get to nominate which URLs deserve a fresh crawl. Updated a thousand product descriptions overnight? Their lastmod moves, the engine sees a concentrated freshness signal in sitemap-products.xml, and recrawl scheduling responds. That's crawl steering you control directly, far more reliable than hoping the crawler stumbles back on its own.

Note that <changefreq> and <priority> are effectively ignored by Google today. Don't spend engineering time computing them; invest that effort in getting lastmod right.

Turning segmentation into a diagnostic workflow

Once segmented, run this loop:

Read per-sitemap indexation rates in Search Console. Rank your templates by indexed ÷ submitted.
Triage the laggards. A low rate on a template almost always points to a systemic cause: near-duplicate content, soft 404s, thin pages, missing internal links, or canonicalization sending equity elsewhere. Because the URLs share a template, the fix is usually one code change applied to thousands of pages.
Isolate experiments. Spin a suspect cohort into its own sitemap to watch indexation move independently after a change, instead of drowning it in the aggregate.
Cross-check with the URL Inspection API or your crawler to confirm whether laggards are "Discovered, not indexed," "Crawled, not indexed," or "Duplicate," because each points to a different remedy.

A useful advanced move: keep a deliberately separate sitemap for pages you suspect are weak, newly launched programmatic sets, for instance. Watching that cohort's indexation curve in isolation tells you whether the new content is earning its place before it dilutes your sitewide signal.

Common mistakes

Listing non-indexable URLs. Sitemaps should contain only canonical, 200-status, indexable URLs. Including redirects, noindex pages, parameter variants, or non-canonical URLs pollutes your indexation math and erodes trust in the file.
Mismatched canonicals. Every URL in a sitemap should be the canonical version. If the sitemap URL and the page's rel=canonical disagree, you're sending conflicting signals.
Submitting children directly when you have an index. Submit the index file only; let the engine discover the rest. Double-submitting creates redundant, confusing reporting.
Stale lastmod after real edits. The inverse of crying wolf, if you genuinely overhaul a page but never bump lastmod, you've waived your recrawl request.
Forgetting the robots reference. Add Sitemap: https://example.com/sitemap-index.xml to robots.txt so the index is discoverable independent of any single search tool.
One giant file at scale. Beyond hiding diagnostics, it's slower to generate, fetch, and parse, and a single malformed entry can throw the whole file.

The payoff

Segmented sitemaps plus an honest lastmod convert a passive discovery file into an active control surface. You get a per-template indexation dashboard for free, a way to localize systemic quality problems to a single code path, and a legitimate channel to request recrawls on the URLs that actually changed. The work is mostly in your sitemap generator, one-time engineering, and it pays back every time you need to answer "which pages aren't getting indexed, and why."

Related on SEO ProCheck

Want this handled properly on your site?

It is exactly the kind of work an advanced technical SEO audit covers. See how an advanced SEO audit works →

Claude Vincent

Claude Vincent is a technical SEO consultant focused on crawlability, rendering, and AI-search visibility. He writes the field guides and case studies at SEO ProCheck, with a bias toward the durable, unglamorous work that decides whether search engines and AI answer engines can actually read and cite a site.

About SEO ProCheck

Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.

Learn more about me

Work With Me

Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.

Contact now

Subscribe to our newsletter!

More from our blog

Diagram of the agent-readable file stack showing AGENTS.md in the code repository read by coding agents, llms.txt and llms-full.txt at the website root read by answer engines, and robots.txt plus RSL as the access and licensing layer beneath both.

Prev. Post

XML Sitemap Strategy Beyond One Big File: Index Sitemaps, Segmentation, and Using lastmod to Steer Crawling

Why one big file hides your real problems

How to segment: by template, not by accident

Using lastmod to steer crawling

Turning segmentation into a diagnostic workflow

Common mistakes

The payoff

Want this handled properly on your site?

About SEO ProCheck

Work With Me

Subscribe to our newsletter!

More from our blog

AGENTS.md vs llms.txt vs llms-full.txt: Which Agent File Does What

Profound vs Semrush and Ahrefs: What an AI-Search Tool Actually Replaces (and What It Doesn't)

SEO vs AEO vs GEO: What Each One Means and How They Actually Differ

Google May 2026 Core Update: What We Learned After the Dust Settled

Pogosticking: The Click Pattern That Quietly Decides Who Ranks

Interaction to Next Paint (INP): The Complete Guide

SSR vs CSR: Why Rendering Decides Whether AI Can Read Your Site

Which AI Bots Are You Actually Blocking? (GPTBot, ClaudeBot, Perplexity & More)

Recent Posts

XML Sitemap Strategy Beyond One Big File: Index Sitemaps, Segmentation, and Using lastmod to Steer Crawling

Why one big file hides your real problems

How to segment: by template, not by accident

Using lastmod to steer crawling

Turning segmentation into a diagnostic workflow

Common mistakes

The payoff

Want this handled properly on your site?

About SEO ProCheck

Work With Me

Subscribe to our newsletter!

More from our blog

Recent Posts

All Website Tags