Orphan Pages: How to Find and Fix Pages With No Internal Links

November 17, 2023
Technical SEO

No Comments

Orphan pages: how to find and fix pages with no internal links

An orphan page is a URL that exists and can be served, but has zero internal links pointing to it from anywhere on your site. Search engines can still find these pages through your sitemap or external links, but with no internal paths feeding them, they receive almost no crawl priority and almost no link equity. The result is a page that quietly underperforms or never gets indexed at all, and most teams have no idea how many they have.

Why Orphan Pages Hurt You

Internal links do two jobs: they tell crawlers a page exists and matters, and they pass authority through your site's structure. A page cut off from that network loses both signals. The practical consequences:

Crawl starvation. Googlebot prioritizes URLs it discovers through links. Sitemap-only discovery is a weaker signal, so orphans get crawled rarely or not at all.
Diluted authority. A page with no inbound internal links sits at the bottom of your PageRank distribution, even if the content is strong.
Index bloat and confusion. Old orphans (expired campaigns, deprecated products, staging leftovers) clutter your index and can trigger thin-content or duplicate signals.
Hidden revenue. A genuinely valuable page nobody links to is wasted inventory. Connecting it often produces fast ranking gains because the content already exists.

The Core Method: Three Datasets, Two Diffs

You cannot find orphans from a single source. A crawler that starts at your homepage and follows links will never reach an orphan by definition, so the crawl alone can't surface them. The reliable approach is to assemble three independent inventories of your URLs and compare them.

The crawl set, every URL reachable by following internal links from your homepage. Run a link-following crawl with Screaming Frog, Sitebulg, or a similar tool, starting from the root with sitemap-crawling disabled so you measure pure link reachability.
The sitemap set, every URL you've declared in your XML sitemaps. Export these directly, or have your crawler ingest the sitemap as a separate list.
The known-URL set, every URL that actually receives traffic or gets crawled, pulled from server access logs and/or Google Search Console (the Pages report and the URL Inspection API). Analytics landing-page exports work as a supplement.

The orphans live in the gaps between these sets. Two diffs do the work:

Sitemap minus Crawl. URLs in your sitemap that the link-crawl never reached. These are your cleanest orphan candidates: you've told Google they exist, but your own navigation doesn't.
Logs/GSC minus Crawl. URLs that get crawled or earn impressions but aren't reachable by links. This catches orphans that aren't even in your sitemap, often the most neglected pages on the site.

Running the Diff in Practice

Once you have three URL lists exported to CSV, normalize them first: lowercase hosts, strip trailing slashes consistently, drop tracking parameters, and resolve protocol/www variants to one canonical form. Skipping normalization produces dozens of false positives where the "same" URL appears in different shapes across sources.

Then a simple set comparison surfaces the candidates. On the command line:

comm -23 sitemap_urls.txt crawl_urls.txt returns URLs in the sitemap but not the crawl (both files sorted with sort -u first).
Repeat with your logs/GSC export in place of the sitemap to catch the second diff.

Screaming Frog automates much of this: connect the GSC and Google Analytics APIs, supply your sitemap, run the crawl, then use the Orphan URLs report under Reports → Crawl Overview. It flags URLs found in connected sources but not in the crawl. Always treat its output as candidates, not verdicts, verify each before acting.

Triage: Link, Redirect, or Remove

Every confirmed orphan resolves to one of three decisions. Pull each candidate's status code, indexation state, impressions/clicks, and topical relevance, then route it:

Link it, when the page is valuable, indexable (200, canonical to self, not noindexed), and topically aligned with existing content. Add 2, 5 contextual internal links from relevant, authoritative pages: hub pages, related articles, category pages. A link from a high-traffic related page beats ten links from footers. Make sure the anchor text is descriptive, not "click here."
Redirect it, when the content is outdated or duplicates a stronger page, but the URL has accrued backlinks, traffic history, or covers a query you still serve elsewhere. 301 it to the closest equivalent live page so any residual equity is preserved. Never blanket-redirect orphans to the homepage; that's treated as a soft 404.
Remove it, when the page has no value, no links, no traffic, and no business reason to exist (expired events, test pages, thin auto-generated URLs). Return a 410 Gone for clean removal, and pull it from your sitemap. If it has zero external signals, removal is cheaper than maintaining a redirect forever.

A quick decision heuristic: Does this page earn impressions or backlinks? If yes, link or redirect, never delete. If no, ask would I write this page today? If yes, link it and improve it. If no, remove it.

Common Mistakes

Trusting a crawler's orphan report blindly. If your sitemap is stale or your GSC connection is partial, the tool will mislabel pages. Confirm reachability manually on a sample before bulk action.
Counting paginated, faceted, or parameter URLs as orphans. These are often intentionally unlinked or canonicalized. Filter them out before triage so you don't waste effort.
Ignoring JavaScript-rendered links. If your navigation builds links client-side and you crawled in text-only mode, every page behind that nav looks orphaned. Enable JavaScript rendering in the crawl, or you'll chase phantoms.
Fixing orphans once. New orphans appear constantly, unpublished-then-republished posts, migrated URLs, CMS quirks. Re-run the diff quarterly, or after any migration or large content push.
Linking from low-value locations. Dumping orphans into a sitewide footer link block technically de-orphans them but passes negligible relevance. Use in-content, contextual links.

FAQ

Are orphan pages a Google penalty? No. There's no penalty for having them. The harm is indirect: poor crawling, weak authority, and index clutter. But large volumes of thin orphans can contribute to site-quality signals that affect how the whole site is assessed.

Can a page be indexed and orphaned at the same time? Yes, frequently. Google can index a URL it found via sitemap or backlink even with no internal links. Indexed-but-orphaned pages are the highest-value fixes because linking them often unlocks rankings immediately.

How many orphans is normal? Small sites should have nearly zero. Large sites with years of content commonly carry hundreds. The number matters less than the trend: it should fall after each cleanup and stay flat, not climb.

Do noindexed pages count? Functionally no, if a page is intentionally noindexed, being unlinked is usually fine. Focus your triage on indexable, canonical, 200-status URLs that you actually want in search.

The discipline here is the repeatable diff, not any single tool. Once you can reliably produce three URL inventories and compare them, finding these pages becomes a 30-minute task you run on schedule rather than a fire drill after traffic drops.

Related on SEO ProCheck

Want this handled properly on your site?

It is exactly the kind of work an advanced technical SEO audit covers. See how an advanced SEO audit works →

Claude Vincent

Claude Vincent is a technical SEO consultant focused on crawlability, rendering, and AI-search visibility. He writes the field guides and case studies at SEO ProCheck, with a bias toward the durable, unglamorous work that decides whether search engines and AI answer engines can actually read and cite a site.

About SEO ProCheck

Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.

Learn more about me

Work With Me

Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.

Contact now

Subscribe to our newsletter!

More from our blog

Diagram of the agent-readable file stack showing AGENTS.md in the code repository read by coding agents, llms.txt and llms-full.txt at the website root read by answer engines, and robots.txt plus RSL as the access and licensing layer beneath both.

Prev. Post

Orphan Pages: How to Find and Fix Pages With No Internal Links

Why Orphan Pages Hurt You

The Core Method: Three Datasets, Two Diffs

Running the Diff in Practice

Triage: Link, Redirect, or Remove

Common Mistakes

FAQ

Want this handled properly on your site?

About SEO ProCheck

Work With Me

Subscribe to our newsletter!

More from our blog

AGENTS.md vs llms.txt vs llms-full.txt: Which Agent File Does What

Profound vs Semrush and Ahrefs: What an AI-Search Tool Actually Replaces (and What It Doesn't)

SEO vs AEO vs GEO: What Each One Means and How They Actually Differ

Google May 2026 Core Update: What We Learned After the Dust Settled

Pogosticking: The Click Pattern That Quietly Decides Who Ranks

Interaction to Next Paint (INP): The Complete Guide

SSR vs CSR: Why Rendering Decides Whether AI Can Read Your Site

Which AI Bots Are You Actually Blocking? (GPTBot, ClaudeBot, Perplexity & More)

Recent Posts

Orphan Pages: How to Find and Fix Pages With No Internal Links

Why Orphan Pages Hurt You

The Core Method: Three Datasets, Two Diffs

Running the Diff in Practice

Triage: Link, Redirect, or Remove

Common Mistakes

FAQ

Want this handled properly on your site?

About SEO ProCheck

Work With Me

Subscribe to our newsletter!

More from our blog

Recent Posts

All Website Tags