XML Sitemap Strategy Beyond One Big File: Index Sitemaps, Segmentation, and Using lastmod to Steer Crawling

No Comments

If your site ships a single sitemap.xml with every URL dumped into it, you're leaving diagnostic signal on the table. A sitemap isn't just a list you hand Google so it can find pages — it's a controllable input you can segment, measure, and use to read indexation health per content type. Done well, your sitemap structure becomes a dashboard that tells you exactly which template is failing to get indexed and lets you steer crawl attention toward the URLs that changed.

Why one big file hides your real problems

Search Console reports sitemap coverage at the sitemap level. If you submit a single file with 50,000 URLs and it shows "38,000 indexed," that number is useless for decision-making. You can't tell whether your blog is fully indexed while your programmatic landing pages are being ignored, or vice versa. The aggregate masks the variance — and variance is where the actionable insight lives.

The fix is to split your sitemaps along the same lines your site is built: by template or content type. When each template has its own sitemap, Search Console's per-sitemap "Submitted vs. Indexed" counts become a genuine indexation-rate breakdown. A template indexing at 95% is healthy. One indexing at 40% is telling you that those pages are thin, duplicative, orphaned, or otherwise judged not worth keeping. You found the problem in seconds instead of crawling and clustering 50,000 URLs by hand.

How to segment: by template, not by accident

Segment so each sitemap maps to a single content type with a coherent quality and freshness profile. For a typical site that means files like:

  • sitemap-products.xml — individual product/detail pages
  • sitemap-categories.xml — listing and faceted hub pages
  • sitemap-blog.xml — editorial articles
  • sitemap-pages.xml — static/marketing pages
  • sitemap-programmatic.xml — any templated, data-driven pages (locations, comparisons, etc.)

Then bind them together with a sitemap index file — an XML file whose entries are <sitemap> references rather than <url> entries. Submit only the index in Search Console and Bing Webmaster Tools; the engines fetch the children automatically.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-06-03T09:14:00+00:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2026-06-01T17:42:00+00:00</lastmod>
  </sitemap>
</sitemapindex>

Respect the hard limits: each sitemap file holds a maximum of 50,000 URLs and 50MB uncompressed. When a template exceeds that, paginate it — sitemap-products-1.xml, sitemap-products-2.xml — and list all parts in the index. A single index file can reference up to 50,000 sitemaps, and you can nest indexes if you genuinely operate at that scale. Keeping each shard well under the cap also keeps the per-sitemap indexation numbers easy to reason about.

Using lastmod to steer crawling

The <lastmod> element is the most misunderstood and most abused field in the spec — and when used honestly, it's your strongest lever for crawl prioritization. Google has stated it uses lastmod as a signal for scheduling recrawls, but only from sites that populate it accurately. The catch is reciprocal: if your lastmod values are untrustworthy, Google stops trusting the field entirely for your domain, and you lose the lever permanently.

Rules for keeping lastmod a credible signal:

  • Only update it when the meaningful content actually changes. A new comment, a rotated "related products" widget, or a refreshed timestamp in the footer is not a content change. If you bump lastmod for cosmetic churn, you're crying wolf.
  • Use a complete, valid date format. W3C Datetime — either 2026-06-03 or the full 2026-06-03T09:14:00+00:00. Include the timezone when you include the time.
  • Never set every URL's lastmod to today's date on each build. This is the single most common way sites destroy the signal's value. If everything changed today, nothing did.
  • Propagate lastmod up to the index entry. The index lastmod should reflect the most recent change within that child sitemap, so engines can cheaply decide which children to refetch.

When your lastmod is honest, you effectively get to nominate which URLs deserve a fresh crawl. Updated a thousand product descriptions overnight? Their lastmod moves, the engine sees a concentrated freshness signal in sitemap-products.xml, and recrawl scheduling responds. That's crawl steering you control directly — far more reliable than hoping the crawler stumbles back on its own.

Note that <changefreq> and <priority> are effectively ignored by Google today. Don't spend engineering time computing them; invest that effort in getting lastmod right.

Turning segmentation into a diagnostic workflow

Once segmented, run this loop:

  1. Read per-sitemap indexation rates in Search Console. Rank your templates by indexed ÷ submitted.
  2. Triage the laggards. A low rate on a template almost always points to a systemic cause: near-duplicate content, soft 404s, thin pages, missing internal links, or canonicalization sending equity elsewhere. Because the URLs share a template, the fix is usually one code change applied to thousands of pages.
  3. Isolate experiments. Spin a suspect cohort into its own sitemap to watch indexation move independently after a change, instead of drowning it in the aggregate.
  4. Cross-check with the URL Inspection API or your crawler to confirm whether laggards are "Discovered – not indexed," "Crawled – not indexed," or "Duplicate," because each points to a different remedy.

A useful advanced move: keep a deliberately separate sitemap for pages you suspect are weak — newly launched programmatic sets, for instance. Watching that cohort's indexation curve in isolation tells you whether the new content is earning its place before it dilutes your sitewide signal.

Common mistakes

  • Listing non-indexable URLs. Sitemaps should contain only canonical, 200-status, indexable URLs. Including redirects, noindex pages, parameter variants, or non-canonical URLs pollutes your indexation math and erodes trust in the file.
  • Mismatched canonicals. Every URL in a sitemap should be the canonical version. If the sitemap URL and the page's rel=canonical disagree, you're sending conflicting signals.
  • Submitting children directly when you have an index. Submit the index file only; let the engine discover the rest. Double-submitting creates redundant, confusing reporting.
  • Stale lastmod after real edits. The inverse of crying wolf — if you genuinely overhaul a page but never bump lastmod, you've waived your recrawl request.
  • Forgetting the robots reference. Add Sitemap: https://example.com/sitemap-index.xml to robots.txt so the index is discoverable independent of any single search tool.
  • One giant file at scale. Beyond hiding diagnostics, it's slower to generate, fetch, and parse, and a single malformed entry can throw the whole file.

The payoff

Segmented sitemaps plus an honest lastmod convert a passive discovery file into an active control surface. You get a per-template indexation dashboard for free, a way to localize systemic quality problems to a single code path, and a legitimate channel to request recrawls on the URLs that actually changed. The work is mostly in your sitemap generator — one-time engineering — and it pays back every time you need to answer "which pages aren't getting indexed, and why."

Want this handled properly on your site?

It is exactly the kind of work an advanced technical SEO audit covers. See how an advanced SEO audit works →

    About SEO ProCheck

    Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.

    Work With Me

    Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.

    Subscribe to our newsletter!

    More from our blog