Designing a Crawlable URL and Folder Architecture for Sites Over 10,000 Pages

September 9, 2024
Technical SEO

No Comments

Designing a crawlable url and folder architecture for sites over 10,000 pages

Once a site crosses roughly 10,000 URLs, crawl budget and internal link equity stop being abstractions and start dictating which pages rank. At that scale, Google will not crawl everything on every visit, and pages buried deep in your folder tree quietly fall out of the index. Getting site architecture SEO right is the difference between authority concentrating on your money pages and leaking into a sprawl of orphaned, rarely-crawled URLs.

Why click depth is the metric that actually matters

Click depth is the minimum number of clicks from your homepage to reach a given page, following internal links. It is not the same as URL folder depth. A page can live at /category/subcategory/product/ and still be one click from the homepage if a hub page links straight to it. Google's crawlers approximate importance partly through how easily a page is reached, so depth is a strong proxy for how often a URL gets recrawled and how much internal PageRank it accumulates.

The working rule for large sites: nothing that needs to rank should sit more than three clicks from the homepage. Three clicks is not a Google-published threshold, but it is a practical ceiling that keeps important pages in the frequently-crawled tier. Beyond four or five clicks, you will routinely see "Discovered, currently not indexed" and "Crawled, currently not indexed" in Search Console, and recrawl intervals stretch into weeks.

Crawl your own site and pull the crawl-depth distribution before changing anything. If you find that 40% of your indexable URLs sit at depth five or deeper, you have your priority list. The goal of the architecture work below is to flatten that distribution.

Separate folder structure from link structure

These are two different decisions, and conflating them is the most common architectural mistake.

Folder structure (the URL path) is for humans, topical signaling, and reporting. Clean, shallow, keyword-relevant paths like /running-shoes/trail/ help users and let you segment performance by folder in Search Console and analytics.
Link structure is what determines click depth and how authority flows. You control it with navigation, hub pages, breadcrumbs, and contextual links, independent of where the file lives in the URL.

A deep URL path does not hurt rankings on its own. A deep click path does. You can have /guides/2026/spring/trail-running/shoe-fit/ and still link to it directly from a hub, keeping it at depth two. Design the folder taxonomy for clarity and reporting; design the link graph to keep depth low.

The hub-and-spoke model for flowing authority to money pages

At 10,000+ pages you cannot link everything to everything. The scalable pattern is a hub-and-spoke (sometimes called pillar, cluster) hierarchy:

Homepage links to a small set of top-level hubs (your main categories or topic pillars), typically 8 to 30 links, not hundreds.
Hub pages are the workhorses. Each hub links to its sub-hubs and to the highest-value "money" pages in that cluster. These are the pages that should rank for your most commercially valuable head terms, so they should receive the most internal links.
Spoke pages (individual products, articles, or detail pages) link back up to their hub and laterally to closely related siblings.

The key insight for revenue: internal links are how you tell Google which pages matter most. If your money pages only receive links from a handful of deep spokes, they will underperform regardless of their content quality. Audit how many internal links each money page receives and make sure your highest-converting pages are also your most internally-linked pages. This is the lever most teams ignore.

Practically, this means your hub pages should not be thin category shells. Give them curated links to top products, supporting guides, and related sub-categories. A strong hub is a routing layer that pushes equity downward to spokes and concentrates it on the money pages you choose to feature.

Flattening depth at scale: the techniques that work

On a small site you flatten depth by hand. On a large site you need systematic mechanisms:

Paginated and faceted hubs: Category pages with pagination push later items deeper. Surface your best spokes (best-sellers, newest, highest-margin) on page one of every hub so they stay shallow, and rely on XML sitemaps plus contextual links for the long tail.
Curated "featured" modules: Hand-pick or algorithmically rotate a block of high-value pages onto hubs and the homepage to pull specific URLs up to depth one or two.
Related-item links: "Related products" or "related articles" blocks create lateral links between spokes, shortening paths and distributing equity across a cluster.
HTML breadcrumbs: Breadcrumbs create consistent upward links and reinforce the folder hierarchy as a link hierarchy. They also generate breadcrumb rich results.
Topic/index pages for the long tail: When you have thousands of detail pages, build intermediate index hubs (e.g., by attribute, location, or tag) so no single page links to ten thousand others, and no detail page sits more than three clicks down.

What does not scale: giant mega-menus that link to every category, or footers stuffed with hundreds of links. These dilute the value of each link and can make every page look equally (un)important. Keep global navigation focused on true top-level hubs.

Crawl budget and the indexable URL count

The fastest way to improve crawling of your important pages is to stop wasting crawl budget on pages that should not be crawled at all. On large sites, faceted navigation and parameters can multiply 10,000 real pages into millions of crawlable URL combinations.

Block low-value parameter and filter combinations from crawling via robots.txt where appropriate, and avoid linking to them in HTML.
Use rel="canonical" to consolidate near-duplicates, but remember canonicals do not save crawl budget, Google still crawls the duplicate to see the canonical tag. Not linking to junk URLs is what saves budget.
Keep XML sitemaps clean: only indexable, canonical, 200-status URLs, segmented by section so you can monitor indexation rates per folder in Search Console.
Fix redirect chains and large blocks of 404/soft-404 pages, they consume crawl capacity that should go to your real inventory.

Common mistakes

Treating URL depth as the problem. A long path is fine; a long click path is not. Fix linking, not just slugs.
Money pages with the fewest internal links. Teams optimize content and ignore that the page receives three internal links while a blog tag page receives four hundred.
Orphan pages from migrations or product feeds. Pages in the sitemap but linked from nowhere. They rarely rank. Crawl and reconcile sitemap URLs against your internal link graph regularly.
Relying on XML sitemaps for discovery. A sitemap helps Google find URLs, but it does not communicate importance the way internal links do. A page in the sitemap with no internal links is still effectively orphaned.
Mega-menus as a depth fix. Linking everything from global nav does not flatten depth meaningfully and dilutes link signals. Use targeted hubs instead.
Letting faceted navigation explode the crawl space. Unmanaged filters can bury your real pages under millions of low-value combinations and starve them of crawl attention.

A practical implementation order

Crawl the site and export the click-depth distribution and internal-link counts per URL.
Identify money pages and confirm they sit at depth ≤3 and receive above-median internal links. Fix the ones that don't first, that's where revenue moves.
Build or strengthen hub pages so each cluster has a clear routing layer.
Add related-item and featured modules to pull priority long-tail pages up.
Trim crawl waste: block junk parameters, clean sitemaps, fix redirect chains.
Recrawl, compare the depth distribution, and watch indexation and recrawl frequency in Search Console over the following weeks.

Architecture is not a one-time project at this scale. Every new product line, content batch, or template change can quietly push pages deeper or orphan them. Build the depth-and-link-count audit into a recurring crawl, and treat any important page that drifts past three clicks as a regression to fix.

Related on SEO ProCheck

Want this handled properly on your site?

It is exactly the kind of work an advanced technical SEO audit covers. See how an advanced SEO audit works →

Claude Vincent

Claude Vincent is a technical SEO consultant focused on crawlability, rendering, and AI-search visibility. He writes the field guides and case studies at SEO ProCheck, with a bias toward the durable, unglamorous work that decides whether search engines and AI answer engines can actually read and cite a site.

About SEO ProCheck

Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.

Learn more about me

Work With Me

Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.

Contact now

Subscribe to our newsletter!

More from our blog

Diagram of the agent-readable file stack showing AGENTS.md in the code repository read by coding agents, llms.txt and llms-full.txt at the website root read by answer engines, and robots.txt plus RSL as the access and licensing layer beneath both.

Prev. Post

Designing a Crawlable URL and Folder Architecture for Sites Over 10,000 Pages

Why click depth is the metric that actually matters

Separate folder structure from link structure

The hub-and-spoke model for flowing authority to money pages

Flattening depth at scale: the techniques that work

Crawl budget and the indexable URL count

Common mistakes

A practical implementation order

Want this handled properly on your site?

About SEO ProCheck

Work With Me

Subscribe to our newsletter!

More from our blog

AGENTS.md vs llms.txt vs llms-full.txt: Which Agent File Does What

Profound vs Semrush and Ahrefs: What an AI-Search Tool Actually Replaces (and What It Doesn't)

SEO vs AEO vs GEO: What Each One Means and How They Actually Differ

Google May 2026 Core Update: What We Learned After the Dust Settled

Pogosticking: The Click Pattern That Quietly Decides Who Ranks

Interaction to Next Paint (INP): The Complete Guide

SSR vs CSR: Why Rendering Decides Whether AI Can Read Your Site

Which AI Bots Are You Actually Blocking? (GPTBot, ClaudeBot, Perplexity & More)

Recent Posts

Designing a Crawlable URL and Folder Architecture for Sites Over 10,000 Pages

Why click depth is the metric that actually matters

Separate folder structure from link structure

The hub-and-spoke model for flowing authority to money pages

Flattening depth at scale: the techniques that work

Crawl budget and the indexable URL count

Common mistakes

A practical implementation order

Want this handled properly on your site?

About SEO ProCheck

Work With Me

Subscribe to our newsletter!

More from our blog

Recent Posts

All Website Tags