XML Sitemap Strategy Beyond One Big File: Index Sitemaps, Segmentation, and Using lastmod to Steer Crawling
- November 12, 2024
- Technical SEO
If your site ships a single sitemap.xml with every URL dumped into it, you're leaving diagnostic signal on the table. A sitemap isn't just a list you hand Google so it can find pages — it's a controllable input you can segment, measure, and use to read indexation health per content type. Done well, your sitemap structure becomes a dashboard that tells you exactly which template is failing to get indexed and lets you steer crawl attention toward the URLs that changed.
Why one big file hides your real problems
Search Console reports sitemap coverage at the sitemap level. If you submit a single file with 50,000 URLs and it shows "38,000 indexed," that number is useless for decision-making. You can't tell whether your blog is fully indexed while your programmatic landing pages are being ignored, or vice versa. The aggregate masks the variance — and variance is where the actionable insight lives.
The fix is to split your sitemaps along the same lines your site is built: by template or content type. When each template has its own sitemap, Search Console's per-sitemap "Submitted vs. Indexed" counts become a genuine indexation-rate breakdown. A template indexing at 95% is healthy. One indexing at 40% is telling you that those pages are thin, duplicative, orphaned, or otherwise judged not worth keeping. You found the problem in seconds instead of crawling and clustering 50,000 URLs by hand.
How to segment: by template, not by accident
Segment so each sitemap maps to a single content type with a coherent quality and freshness profile. For a typical site that means files like:
sitemap-products.xml— individual product/detail pagessitemap-categories.xml— listing and faceted hub pagessitemap-blog.xml— editorial articlessitemap-pages.xml— static/marketing pagessitemap-programmatic.xml— any templated, data-driven pages (locations, comparisons, etc.)
Then bind them together with a sitemap index file — an XML file whose entries are <sitemap> references rather than <url> entries. Submit only the index in Search Console and Bing Webmaster Tools; the engines fetch the children automatically.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2026-06-03T09:14:00+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2026-06-01T17:42:00+00:00</lastmod>
</sitemap>
</sitemapindex>Respect the hard limits: each sitemap file holds a maximum of 50,000 URLs and 50MB uncompressed. When a template exceeds that, paginate it — sitemap-products-1.xml, sitemap-products-2.xml — and list all parts in the index. A single index file can reference up to 50,000 sitemaps, and you can nest indexes if you genuinely operate at that scale. Keeping each shard well under the cap also keeps the per-sitemap indexation numbers easy to reason about.
Using lastmod to steer crawling
The <lastmod> element is the most misunderstood and most abused field in the spec — and when used honestly, it's your strongest lever for crawl prioritization. Google has stated it uses lastmod as a signal for scheduling recrawls, but only from sites that populate it accurately. The catch is reciprocal: if your lastmod values are untrustworthy, Google stops trusting the field entirely for your domain, and you lose the lever permanently.
Rules for keeping lastmod a credible signal:
- Only update it when the meaningful content actually changes. A new comment, a rotated "related products" widget, or a refreshed timestamp in the footer is not a content change. If you bump
lastmodfor cosmetic churn, you're crying wolf. - Use a complete, valid date format. W3C Datetime — either
2026-06-03or the full2026-06-03T09:14:00+00:00. Include the timezone when you include the time. - Never set every URL's
lastmodto today's date on each build. This is the single most common way sites destroy the signal's value. If everything changed today, nothing did. - Propagate
lastmodup to the index entry. The indexlastmodshould reflect the most recent change within that child sitemap, so engines can cheaply decide which children to refetch.
When your lastmod is honest, you effectively get to nominate which URLs deserve a fresh crawl. Updated a thousand product descriptions overnight? Their lastmod moves, the engine sees a concentrated freshness signal in sitemap-products.xml, and recrawl scheduling responds. That's crawl steering you control directly — far more reliable than hoping the crawler stumbles back on its own.
Note that <changefreq> and <priority> are effectively ignored by Google today. Don't spend engineering time computing them; invest that effort in getting lastmod right.
Turning segmentation into a diagnostic workflow
Once segmented, run this loop:
- Read per-sitemap indexation rates in Search Console. Rank your templates by indexed ÷ submitted.
- Triage the laggards. A low rate on a template almost always points to a systemic cause: near-duplicate content, soft 404s, thin pages, missing internal links, or canonicalization sending equity elsewhere. Because the URLs share a template, the fix is usually one code change applied to thousands of pages.
- Isolate experiments. Spin a suspect cohort into its own sitemap to watch indexation move independently after a change, instead of drowning it in the aggregate.
- Cross-check with the URL Inspection API or your crawler to confirm whether laggards are "Discovered – not indexed," "Crawled – not indexed," or "Duplicate," because each points to a different remedy.
A useful advanced move: keep a deliberately separate sitemap for pages you suspect are weak — newly launched programmatic sets, for instance. Watching that cohort's indexation curve in isolation tells you whether the new content is earning its place before it dilutes your sitewide signal.
Common mistakes
- Listing non-indexable URLs. Sitemaps should contain only canonical,
200-status, indexable URLs. Including redirects,noindexpages, parameter variants, or non-canonical URLs pollutes your indexation math and erodes trust in the file. - Mismatched canonicals. Every URL in a sitemap should be the canonical version. If the sitemap URL and the page's
rel=canonicaldisagree, you're sending conflicting signals. - Submitting children directly when you have an index. Submit the index file only; let the engine discover the rest. Double-submitting creates redundant, confusing reporting.
- Stale
lastmodafter real edits. The inverse of crying wolf — if you genuinely overhaul a page but never bumplastmod, you've waived your recrawl request. - Forgetting the robots reference. Add
Sitemap: https://example.com/sitemap-index.xmltorobots.txtso the index is discoverable independent of any single search tool. - One giant file at scale. Beyond hiding diagnostics, it's slower to generate, fetch, and parse, and a single malformed entry can throw the whole file.
The payoff
Segmented sitemaps plus an honest lastmod convert a passive discovery file into an active control surface. You get a per-template indexation dashboard for free, a way to localize systemic quality problems to a single code path, and a legitimate channel to request recrawls on the URLs that actually changed. The work is mostly in your sitemap generator — one-time engineering — and it pays back every time you need to answer "which pages aren't getting indexed, and why."
Want this handled properly on your site?
It is exactly the kind of work an advanced technical SEO audit covers. See how an advanced SEO audit works →
About SEO ProCheck
Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.
Work With Me
Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.








