TL;DR
An XML sitemap is a machine-readable list of the URLs you want search engines to find. It is a discovery aid, not an indexing guarantee. Include only canonical, indexable, status-200 URLs. Never list noindexed, redirected, blocked, or 404 pages. Use an accurate lastmod and skip changefreq and priority entirely, because Google ignores both. Stay under the official limit of 50,000 URLs and 50MB uncompressed per file, and use a sitemap index to group multiple files. Submit through Search Console and your robots.txt. The old ping endpoint is gone, so do not rely on it. In the AI era the same clean, honest sitemap helps every crawler that respects the standard.
Sitemaps are one of the oldest pieces of SEO infrastructure, and also one of the most misunderstood. People treat them as a magic indexing button, stuff them with every URL a CMS can generate, and then wonder why nothing improves. This reference explains what an XML sitemap actually does, what belongs in it, and how to keep it honest in 2026, when traditional search crawlers and AI crawlers are both reading your files.
What an XML sitemap is and what it does
An XML sitemap is a structured file that lists URLs on your site, optionally with metadata such as when each URL was last modified. It uses the open Sitemaps protocol, originally published at sitemaps.org and supported by Google, Bing, and other major engines.
The single most important thing to understand: a sitemap is a discovery aid, not an indexing guarantee. It tells a crawler "these URLs exist and here is when they last changed." It does not force anything into the index, and it does not raise rankings. Google states plainly that it includes URLs in a sitemap that it wants in search results, and that listing a URL does not guarantee crawling or indexing.
So why bother? Because discovery still matters. Large sites, sites with weak internal linking, new sites with few backlinks, and sites with content that is not well connected all benefit when a crawler has a clean manifest of canonical URLs to work from. A sitemap also gives you a feedback loop in Search Console: submitted versus discovered counts that reveal crawling and indexing problems early.
What to include, and what to exclude
The rule is short: include only URLs you genuinely want to appear in search, and exclude everything else. In practice every URL in your sitemap should pass four tests.
- Canonical. List the canonical version of each page, not parameter variants, session URLs, or duplicate paths. The URL in the sitemap should match the URL in the page's
rel=canonicaltag. - Indexable. The page must not carry a
noindexdirective. A noindexed URL in a sitemap sends a contradictory signal: "find this" plus "do not index this." Search Console flags this as an error. - Status 200. The URL must return a live 200 response. Redirected URLs (301 or 302) and dead URLs (404 or 410) do not belong in a sitemap. List the final destination, not a hop.
- Crawlable. The URL must not be blocked by robots.txt. Listing a blocked URL is another contradiction the crawler cannot resolve. If you are unsure how robots rules interact, see our complete robots.txt reference.
Things that should usually stay out: paginated archive pages you do not want indexed, internal search results, faceted or filtered parameter URLs, thank-you and cart pages, staging URLs, and anything tagged noindex. A common silent failure is a URL that returns 200 to a browser but a soft 404 in substance, an empty or near-empty page the engine treats as missing. Those waste crawl attention and pollute your submitted-versus-indexed ratio.
lastmod done right
The lastmod element tells crawlers when a URL last changed in a meaningful way. Google has confirmed it uses lastmod as a signal for scheduling recrawls of URLs it already knows, so an accurate value can speed up how quickly real updates are picked up.
The catch is the word accurate. Google only trusts lastmod when it is consistently and verifiably correct. If every URL shows today's date on every crawl, or the date never matches actual content changes, the engine learns to ignore the field on your site. The biggest lastmod mistake is a CMS or plugin that resets the date site-wide on every publish or rebuild, which makes the whole sitemap look freshly changed when nothing changed.
Use the W3C datetime format. A date such as 2026-06-04 is valid, and a full timestamp with timezone such as 2026-06-04T09:30:00+00:00 is also valid. Set the value only when the page content actually changes, not when an ad rotates or a sidebar widget updates.
Two optional elements deserve a clear verdict. Google does not use changefreq or priority at all. The priority value is too subjective to be reliable, and changefreq overlaps with lastmod. You can safely omit both. They add bytes and maintenance for zero benefit on Google, and most other engines treat them as hints at best.
Sitemap index files and the official limits
Every sitemap format limits a single file to 50,000 URLs or 50MB uncompressed, whichever comes first. These are the official limits published by Google and by the Sitemaps protocol. If you cross either threshold, split into multiple sitemap files and tie them together with a sitemap index.
A sitemap index is a sitemap of sitemaps. It is itself bound by the same limits: up to 50,000 child sitemaps and 50MB uncompressed. With nesting you can therefore reference an enormous number of URLs from one entry point. Compression with gzip is allowed and reduces transfer size, but the 50MB cap applies to the uncompressed file, so gzip does not let you pack more URLs into a single file.
Here is a minimal sitemap index that points to two segmented sitemaps:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-posts.xml</loc>
<lastmod>2026-06-04</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-pages.xml</loc>
<lastmod>2026-05-28</lastmod>
</sitemap>
</sitemapindex>And here is a single URL entry inside one of those child files, stripped to the elements that matter:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/blog/xml-sitemaps/</loc>
<lastmod>2026-06-04</lastmod>
</url>
</urlset>Notice what is missing: no changefreq, no priority. That is intentional. URLs must be fully qualified and use the same protocol and host as the page itself, and the entire file should be UTF-8 encoded with special characters escaped.
Special sitemaps: image, video, and news
Beyond the standard format there are extensions for specific content types, each adding its own namespace and tags inside the same URL entries.
- Image sitemaps. You can add image information to a URL entry so Google can discover images that are hard to find through normal crawling, for example images loaded by script. Each URL can reference multiple images. This is optional, and for most sites well-built pages already expose their images.
- Video sitemaps. These describe video content on a page, including title, description, thumbnail, and duration, which helps videos surface in search. Useful for sites where video is a primary asset.
- News sitemaps. For publishers in Google News, a news sitemap lists articles published in the last two days and is meant to be small and frequently updated. It is a specialized tool, not something a typical site needs.
Most sites do not need these. Add a special sitemap only when that content type is genuinely important to your visibility and the standard crawl is not surfacing it well.
How to submit and monitor
There are two reliable ways to tell engines where your sitemap lives, and one method that no longer works.
First, reference the sitemap in your robots.txt with a Sitemap: line containing the full URL. Any compliant crawler reading robots.txt will find it. You can list multiple sitemap or sitemap-index URLs.
Second, submit the sitemap in Google Search Console (and Bing Webmaster Tools). This is the method that gives you reporting: how many URLs were submitted, how many discovered, when the file was last read, and any parse errors. Treat the Search Console Sitemaps report as a monitoring dashboard, not a one-time chore.
The method that no longer works is the ping endpoint. Google deprecated the unauthenticated HTTP ping in 2023, and requests to it now return a 404. Bing retired its equivalent as well. Google found that unauthenticated pings were dominated by spam and added little value. If you have old code or a plugin still pinging that URL, it will not harm you, but it does nothing, so stop relying on it. Submission via robots.txt and Search Console is the supported path.
The AI era angle
AI crawlers from assistants and answer engines are reading the web at scale, and the well-behaved ones still respect the same open standards: robots.txt for permission and sitemaps for discovery. A clean, accurate, canonical-only sitemap is exactly as useful to a respectful AI crawler as it is to Googlebot, because it is the cheapest way to hand any agent a trustworthy map of your content. The work does not change. If you want to understand which agents are actually fetching your pages, our AI crawler map tracks the major ones. The takeaway: there is no separate "AI sitemap" to build. The honest sitemap you already maintain is the asset.
Common mistakes to avoid
- Listing noindexed URLs. The "find me but do not index me" contradiction. Remove noindexed pages from the sitemap.
- Listing redirected URLs. Always list the final 200 destination, never a 301 or 302 source.
- Listing 404 or 410 URLs. Dead URLs waste crawl attention and inflate error counts.
- Stale or fake lastmod. A date that never matches real changes teaches the engine to ignore the field across your whole site.
- Including non-canonical or parameter URLs. Mismatches between the sitemap URL and the canonical tag confuse consolidation.
- Including robots-blocked URLs. The crawler cannot fetch what you told it to ignore.
- Exceeding the limits. A file over 50,000 URLs or 50MB will be rejected or truncated. Split and index.
- Relative URLs or wrong host. Use absolute, fully qualified URLs on the correct protocol and domain.
- Treating the sitemap as an indexing lever. It is discovery, not a ranking or indexing guarantee.
FAQ
Does an XML sitemap improve my rankings?
No. A sitemap aids discovery and recrawl scheduling. It does not raise rankings and does not guarantee indexing. Rankings depend on content quality, relevance, and authority.
Do I still need a sitemap if my site is small and well linked?
It rarely hurts and often helps. For a small, well-linked site the benefit is modest, but the Search Console reporting alone justifies keeping one.
Should I include changefreq and priority?
No. Google ignores both. You can omit them entirely and lose nothing on Google, and most other engines treat them as weak hints at most.
What are the exact size limits?
50,000 URLs or 50MB uncompressed per file, whichever comes first. Cross either threshold and you split into multiple files joined by a sitemap index, which is bound by the same limits.
How do I submit my sitemap now that ping is gone?
Reference it in robots.txt with a Sitemap line and submit it in Google Search Console and Bing Webmaster Tools. The old HTTP ping endpoint was deprecated in 2023 and returns 404.
Can I gzip my sitemap to fit more URLs?
You can gzip to reduce transfer size, but the 50MB cap applies to the uncompressed file, so compression does not let you pack more URLs into one file.
Is your sitemap helping or quietly leaking errors?
A clean, canonical-only sitemap is foundational. We audit yours alongside your crawl and indexing health to find what is wasting crawl attention.
Claude Vincent is a technical SEO consultant focused on crawlability, rendering, and AI-search visibility. He writes the field guides and case studies at SEO ProCheck, with a bias toward the durable, unglamorous work that decides whether search engines and AI answer engines can actually read and cite a site.
About SEO ProCheck
Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.
Work With Me
Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.







