Screaming Frog Crawl Recipes Every SEO Should Save

No Comments

Most SEOs run Screaming Frog the same way every time: point it at a domain, hit start, and skim the tabs. That leaves the most powerful features, custom extraction, list mode, and surgical configuration, untouched. The recipes below are saved configurations you can paste in once and reuse on every audit, turning a generic crawler into a precision diagnostic instrument.

Why save configurations instead of crawling fresh every time

Screaming Frog lets you export a full configuration as a .seospiderconfig file via File > Configuration > Save As. Build a recipe once, save it, and load it the next time you need that exact diagnostic. The real leverage comes from three areas most people skip: Custom Extraction (Configuration > Custom > Extraction), list mode (Mode > List), and tightening the spider config so you crawl only what answers your question. Pair these and you stop drowning in 50,000 URLs to confirm one hypothesis.

Custom extraction recipes

Custom extraction pulls anything from the rendered or raw HTML using CSSPath, XPath, or regex. Set the extractor type in the dropdown next to each field. These are the ones worth keeping permanently.

1. Pull structured data type and validate at scale

Use XPath to grab the @type from JSON-LD so you can see which template emits which schema:

  • Extractor: XPath
  • Expression: //script[@type='application/ld+json']/text()
  • Set the dropdown to Extract Text. Then filter the export for pages missing Article, Product, or FAQPage where you expect it.

2. Detect canonical mismatches your CMS hides

Capture the canonical URL into its own column so you can compare it against the address with a spreadsheet formula rather than eyeballing the Canonicals tab:

  • Extractor: XPath
  • Expression: //link[@rel='canonical']/@href

3. Find pages missing a self-referencing hreflang or analytics tag

Confirm GA4 or GTM fires site-wide with a regex extractor, faster than checking a sample by hand:

  • Extractor: Regex
  • Expression: (G-[A-Z0-9]{8,}|GTM-[A-Z0-9]+)
  • Any blank cell in the export is a page where the container is absent. Critical after a template migration.

4. Extract word count and thin-content signals

Target the main content container directly instead of trusting whole-page word count, which includes nav and footer boilerplate:

  • Extractor: CSSPath
  • Expression: article or main (adjust to your template), Extract Text
  • Use =LEN() minus spaces logic in Sheets, or sort by the extracted string length, to surface genuinely thin body copy.

5. Scrape prices, ratings, or stock status for ecommerce diffs

Point CSSPath at the price node (span.price, .product-price) and the review count. Run the same recipe weekly and diff the exports to catch unintended price or availability changes that tank rich results.

List mode recipes

Switch to Mode > List to crawl a fixed set of URLs instead of following links. This is where Screaming Frog becomes a verification tool rather than a discovery tool.

  1. Audit only your indexable money pages. Export your priority URLs (from GSC, a sitemap, or a ranking export), paste them into list mode, and check status code, canonical, indexability, and title in one pass, no crawling the entire site.
  2. Validate a redirect map before launch. In list mode, enable Configuration > Spider > Advanced > Always Follow Redirects, paste your old URLs, and use Reports > Redirects > Redirect Chains to confirm every legacy URL lands on a 200 in a single hop. Catch 302s, chains, and loops before they cost you equity.
  3. Crawl a sitemap to find orphans and errors. Use Download Sitemap in list mode (Upload > Download Sitemap Index). Cross-reference against a full crawl: URLs in the sitemap but not in the crawl are orphaned; URLs returning non-200s in your sitemap are actively telling Google to index broken pages.
  4. Spot-check GSC-flagged URLs. When Search Console reports "Crawled, currently not indexed" or soft 404s, paste that exact list and inspect indexability, canonical target, and response code together to diagnose the pattern fast.

Configuration recipes that sharpen every crawl

The spider configuration is where you trade noise for signal. Save these as named configs.

The "JS-rendered audit" config

  • Configuration > Spider > Rendering > JavaScript. Set AJAX timeout to at least 5 seconds for heavy SPAs.
  • Compare rendered vs. raw HTML using the JavaScript tab filters, "Contains JavaScript Links" and "Pages with Blocked Resources" reveal content Google may never see.

The "log-light, fast structural" config

  • Disable image, CSS, JS, and SWF crawling under Spider > Crawl when you only care about HTML architecture. This dramatically cuts crawl time and memory.
  • Turn off Store HTML / Rendered HTML under Spider > Extraction if you don't need it, huge memory savings on large sites.

The "include/exclude scalpel" config

  • Configuration > Include with a regex like https://example.com/blog/.* crawls only one section.
  • Configuration > Exclude with .*?.* drops every parameterized URL when faceted navigation is flooding your crawl. Combine with Remove Parameters for canonicalization testing.

The "crawl as Googlebot" config

  • Set Configuration > User-Agent to Googlebot Smartphone and respect or ignore robots.txt deliberately (Spider > Robots.txt) to see exactly what the mobile crawler reaches, and what's being blocked.

Connect APIs for one-pass prioritization

Under Configuration > API Access, plug in Google Search Console, GA4, and PageSpeed Insights. Now a single crawl returns each URL alongside its clicks, impressions, sessions, and Core Web Vitals. Sort by impressions descending, filter to non-indexable or slow pages, and you have a ranked fix list backed by real traffic data instead of crawl-depth guesses.

Common mistakes

  • Crawling in default (memory) storage on large sites. Switch to Database Storage Mode (Configuration > System > Storage Mode) for anything over ~150k URLs, or the crawl dies.
  • Trusting raw HTML on a JS site. If the framework renders client-side, enable JavaScript rendering or your content and link audits are fiction.
  • Forgetting list mode ignores robots.txt by default in some setups. Verify your robots configuration matches your intent before drawing conclusions.
  • Extracting from the wrong DOM state. CSSPath and XPath run against rendered HTML only when rendering is on; otherwise they hit raw source. Match the extractor mode to the rendering mode.

Build your library

Save each of these as a named .seospiderconfig file in a shared folder so your whole team loads the same scalpel for the same job. The goal isn't to crawl more, it's to ask one precise question per crawl and get an exportable answer in minutes. That discipline is what separates a crawl report from a diagnosis.

Want this handled properly on your site?

It is exactly the kind of work an advanced technical SEO audit covers. See how an advanced SEO audit works →

Claude Vincent is a technical SEO consultant focused on crawlability, rendering, and AI-search visibility. He writes the field guides and case studies at SEO ProCheck, with a bias toward the durable, unglamorous work that decides whether search engines and AI answer engines can actually read and cite a site.

    About SEO ProCheck

    Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.

    Work With Me

    Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.

    Subscribe to our newsletter!

    More from our blog