Question 1

What is a robots.txt file?

Accepted Answer

A text file placed at your domain root (example.com/robots.txt) that tells search engine crawlers which URLs they can access. It uses the Robots Exclusion Protocol. Important: it's a request, not enforcement. Good bots respect it; malicious bots ignore it.

Question 2

Where should robots.txt be located?

Accepted Answer

Always at the domain root: https://example.com/robots.txt. Subdirectories don't work. Each subdomain needs its own file. The file must return HTTP 200 status code to be valid and readable by crawlers.

Question 3

Is robots.txt required for SEO?

Accepted Answer

No, it's optional. Without one, crawlers assume full access. However, having one helps control crawl budget, block non-essential pages, and declare sitemap location. A missing file returns 404, which means "crawl everything."

Question 4

Does robots.txt provide security?

Accepted Answer

No. The file is publicly readable at /robots.txt. Anyone can view what you're blocking. For actual security, use password protection or authentication. Never use robots.txt to hide sensitive content from bad actors.

Question 5

Does robots.txt prevent indexing?

Accepted Answer

No. Blocking crawling differs from blocking indexing. Google may still index blocked URLs found via links, showing "No information available." Use noindex meta tag or X-Robots-Tag header to actually prevent indexing.

Question 6

What's the difference between crawling and indexing?

Accepted Answer

Crawling means bots access and read your page. Indexing means search engines add pages to their database. Robots.txt controls crawling only. If you block crawling, bots can't see noindex tags on the page.

Question 7

How do I create a robots.txt file?

Accepted Answer

Create a plain text file named exactly "robots.txt" (lowercase). Add your directives, one per line. Upload to your domain root via FTP or CMS. Verify by visiting yourdomain.com/robots.txt in a browser.

Question 8

What does User-agent mean?

Accepted Answer

User-agent specifies which crawler the rules apply to. "User-agent: *" targets all bots. "User-agent: Googlebot" targets only Google. You can have multiple User-agent blocks with different rules for each crawler.

Question 9

What does Disallow mean?

Accepted Answer

Disallow tells crawlers not to access matching URLs. "Disallow: /admin/" blocks the admin directory. "Disallow: /" blocks everything. "Disallow:" with empty value allows everything. Paths are case-sensitive.

Question 10

What does Allow mean?

Accepted Answer

Allow explicitly permits crawling of URLs otherwise blocked by Disallow. Useful for exceptions: block /private/ but Allow: /private/public-page.html. Googlebot and Bingbot support Allow; some crawlers don't.

Question 11

What is the Sitemap directive?

Accepted Answer

Tells crawlers your XML sitemap location. Format: Sitemap: https://example.com/sitemap.xml. Place at file bottom. You can list multiple sitemaps. Helps discovery even without Search Console submission.

Question 12

What is Crawl-delay?

Accepted Answer

Requests bots wait X seconds between requests. "Crawl-delay: 10" means 10-second gaps. Googlebot ignores this directive entirely; use Search Console instead. Bingbot respects it. Excessive delays hurt indexation.

Question 13

Can I use wildcards in robots.txt?

Accepted Answer

Googlebot and Bingbot support * (matches any characters) and $ (matches URL end). Example: "Disallow: /*.pdf$" blocks all PDFs. "Disallow: /*?sort=" blocks sort parameters. Basic crawlers only understand literal paths.

Question 14

Is robots.txt case-sensitive?

Accepted Answer

Directive names (User-agent, Disallow) are case-insensitive. URL paths are case-sensitive. "/Admin/" and "/admin/" are different. Always match your actual URL casing exactly when writing rules.

Question 15

How do I add comments?

Accepted Answer

Use # for comments. Everything after # on a line is ignored. Example: "# Block admin section" before the rule. Comments document your rules for future reference and help team members understand intent.

Question 16

How do I block a directory?

Accepted Answer

Use: Disallow: /directory-name/ with trailing slash. This blocks the directory and all contents. Without trailing slash, "/private" also blocks "/private-info.html" which you may not intend.

Question 17

How do I block a single page?

Accepted Answer

Use the exact path: Disallow: /page-to-block.html. For dynamic URLs, include parameters: Disallow: /page?id=123. Be precise to avoid accidentally blocking other pages with similar paths.

Question 18

How do I block URL parameters?

Accepted Answer

Use wildcards: "Disallow: /*?sort=" blocks URLs with sort parameter. "Disallow: /*&sessionid=" blocks session IDs. Alternatively, use canonical tags or Search Console parameter handling for complex cases.

Question 19

How do I block specific file types?

Accepted Answer

Use wildcard with $: "Disallow: /*.pdf$" blocks PDFs. "Disallow: /*.doc$" blocks Word files. The $ ensures matching file extension at URL end, not just containing that string mid-path.

Question 20

Should I block internal search results?

Accepted Answer

Generally yes. Internal search creates infinite URL variations, wastes crawl budget, provides little SEO value. Use "Disallow: /search" or "Disallow: /?s=" for WordPress. Consider noindex as alternative approach.

Question 21

How do I block a staging site?

Accepted Answer

On staging, use: User-agent: * and Disallow: /. This blocks all crawlers from all pages. Also add noindex meta tags as backup. Better approach: password-protect staging or restrict access by IP.

Question 22

How do I allow one page in a blocked directory?

Accepted Answer

Combine Disallow and Allow. Example: Disallow: /members/ then Allow: /members/join.html. Googlebot uses the most specific match. The /members/join.html page remains accessible while directory is blocked.

Question 23

What is the Googlebot user-agent?

Accepted Answer

Googlebot is Google's main web crawler. Variations include Googlebot-Image, Googlebot-News, Googlebot-Video. Target all with "User-agent: Googlebot" or specify variants. Variant rules override general Googlebot rules.

Question 24

What is Bingbot?

Accepted Answer

Microsoft's crawler for Bing search. Also powers Yahoo and DuckDuckGo partially. Target with "User-agent: Bingbot". Respects Crawl-delay unlike Googlebot. Important for AI visibility since some AI uses Bing's index.

Question 25

How do I block AI crawlers?

Accepted Answer

Add separate User-agent blocks: GPTBot (OpenAI training), ChatGPT-User (browsing), OAI-SearchBot (SearchGPT), ClaudeBot (Anthropic), Google-Extended (Gemini training), PerplexityBot. Each needs its own Disallow: / rule.

Question 26

What is GPTBot?

Accepted Answer

OpenAI's crawler for AI training data. Blocking GPTBot prevents use in future model training. Does NOT affect real-time ChatGPT browsing (that's ChatGPT-User). Block with: User-agent: GPTBot then Disallow: /.

Question 27

What is Google-Extended?

Accepted Answer

Google's crawler for Gemini/Bard AI training. Blocking prevents AI training use but does NOT affect Search rankings or AI Overviews. Those use regular Googlebot. Safe to block for AI training opt-out only.

Question 28

Can I set different rules for different crawlers?

Accepted Answer

Yes. Create separate User-agent blocks. Allow Googlebot everywhere but block GPTBot entirely. Each crawler follows its own User-agent section rules, or falls back to User-agent: * if no specific match exists.

Question 29

How do I test my robots.txt?

Accepted Answer

Use Google Search Console's robots.txt Tester under Settings. Enter URLs to check blocking status. Also test by visiting /robots.txt directly, use Screaming Frog's validator, or online robots.txt checkers.

Question 30

How do I check if a URL is blocked?

Accepted Answer

In Search Console, use URL Inspection tool; it shows "Blocked by robots.txt" if applicable. The robots.txt Tester shows which specific rule blocks each URL. Screaming Frog flags blocked URLs during crawls.

Question 31

Why isn't my robots.txt working?

Accepted Answer

Common causes: wrong file location, syntax errors, file returns error status, caching delays (crawlers cache 24+ hours), wrong user-agent name, case mismatch in paths, or expecting it to block indexing.

Question 32

How often do crawlers refresh robots.txt?

Accepted Answer

Googlebot caches robots.txt for up to 24 hours, sometimes longer. Changes don't take effect immediately. No way to force refresh. For urgent changes, implement server-side blocks while waiting for cache.

Question 33

What happens with syntax errors?

Accepted Answer

Crawlers may ignore malformed rules or the entire file. Common errors: spaces before directives, missing colons, typos in directive names. Always validate syntax using Search Console or validators before deploying.

Question 34

Should I block CSS and JavaScript?

Accepted Answer

No. Google needs CSS/JS to render pages correctly. Blocking prevents proper rendering and hurts rankings. This outdated advice no longer applies. Allow Googlebot access to all resources needed for rendering.

Question 35

Should I block images?

Accepted Answer

Generally no. Blocked images won't appear in Google Images and may affect page understanding. Only block for specific reasons like bandwidth or truly private images needing protection from all crawlers.

Question 36

Can I put noindex in robots.txt?

Accepted Answer

No. Google deprecated the "Noindex:" directive in robots.txt in 2019. It's completely ignored now. Use noindex meta tag in HTML or X-Robots-Tag HTTP header instead for index control.

Question 37

Can I block crawling AND use noindex?

Accepted Answer

Not together effectively. If crawling is blocked, bots never see the noindex tag. For blocked pages needing noindex, use X-Robots-Tag HTTP header instead, which works before page content loads.

Question 38

Does the trailing slash matter?

Accepted Answer

Yes. "Disallow: /folder" blocks /folder, /folder/, and /folder-name.html. "Disallow: /folder/" only blocks /folder/ directory contents. Use trailing slash for directories to be precise about scope.

Question 39

What if I accidentally block everything?

Accepted Answer

"Disallow: /" under "User-agent: *" blocks all crawlers from all pages. Your entire site disappears from search over time. Always test robots.txt after editing. Verify critical pages with URL Inspection.

Question 40

Can I list multiple sitemaps?

Accepted Answer

Yes. Add multiple Sitemap directives, each on its own line. Example: Sitemap: https://example.com/sitemap-posts.xml then Sitemap: https://example.com/sitemap-pages.xml. All will be discovered by crawlers.

Question 41

What is the Host directive?

Accepted Answer

A Yandex-specific directive indicating preferred domain version. Example: Host: www.example.com. Google ignores it; use canonical tags and Search Console settings instead for preferred domain specification.

Question 42

What is the Clean-param directive?

Accepted Answer

Yandex-specific directive for handling URL parameters. Tells Yandex which parameters don't affect page content. Google ignores it; use Search Console's URL Parameters tool or canonical tags instead.

Question 43

Do subdomains share robots.txt?

Accepted Answer

No. Each subdomain needs its own robots.txt file. blog.example.com/robots.txt is completely separate from example.com/robots.txt. Rules don't cascade or inherit across subdomains.

Question 44

Do HTTP and HTTPS need separate files?

Accepted Answer

Technically yes, they're different protocols. However, you should redirect HTTP to HTTPS anyway. Maintain robots.txt on your canonical HTTPS version. The redirect handles HTTP requests.

Question 45

Does my CDN need robots.txt?

Accepted Answer

If your CDN uses a different domain (cdn.example.com), it may need its own robots.txt. For same-domain CDN setups, your main robots.txt applies. Check if CDN assets are being crawled unexpectedly.

Question 46

Is there a size limit?

Accepted Answer

Google limits processing to 500KB. Larger files are truncated. Most sites never approach this limit. If yours does, consolidate rules using wildcards or reconsider what truly needs blocking.

Question 47

Can I use non-ASCII characters?

Accepted Answer

Robots.txt should use UTF-8 encoding for international characters in paths. Most crawlers handle UTF-8 correctly. Avoid special characters in directive names; those must be ASCII only.

Question 48

Does rule order matter?

Accepted Answer

For Googlebot, specificity wins over order. More specific paths take precedence. For other crawlers, order may matter; place Allow before Disallow for safety. Test your specific use case.

Question 49

How do I edit robots.txt in WordPress?

Accepted Answer

WordPress generates virtual robots.txt by default. Edit via Yoast SEO or Rank Math settings, or create a physical file in your root directory. Physical file overrides virtual generation.

Question 50

How do I edit robots.txt in Shopify?

Accepted Answer

Shopify allows robots.txt customization through theme.liquid file using the robots.txt.liquid template. Access via Online Store > Themes > Edit Code. Shopify has default rules you can modify.

Question 51

How do I verify bots are respecting robots.txt?

Accepted Answer

Check server log files for bot activity. Look for crawl attempts on blocked paths; compliant bots won't request them. Non-compliant or malicious bots may ignore rules and crawl anyway.

Rule	Effect
Disallow: /	Block entire site
Disallow:	Allow entire site (empty value)
Disallow: /folder/	Block directory and contents
Disallow: /*.pdf$	Block all PDF files
Allow: /folder/page.html	Exception within blocked directory

Directive	Purpose	Example	Notes
`User-agent`	Names the crawler a rule block applies to	`User-agent: *`	Use * for all bots, or a name like Googlebot
`Disallow`	Blocks crawling of matching paths	`Disallow: /admin/`	An empty value allows everything
`Allow`	Permits a path inside a disallowed area	`Allow: /admin/public/`	The most specific match wins
`Sitemap`	Points crawlers to your XML sitemap	`Sitemap: https://example.com/sitemap.xml`	Must be an absolute URL, can repeat
`Crawl-delay`	Requests seconds between requests	`Crawl-delay: 10`	Bing respects it, Google ignores it
`$` and `*`	Pattern matching in paths	`Disallow: /*.pdf$`	* matches any sequence, $ anchors the end

Crawler	Company	Purpose
GPTBot	OpenAI	Model training
ChatGPT-User	OpenAI	Real-time browsing
OAI-SearchBot	OpenAI	SearchGPT index
ClaudeBot	Anthropic	Model training
Google-Extended	Google	Gemini training
PerplexityBot	Perplexity	Search index

Robots.txt FAQ: Complete Guide to Crawler Directives

Table of Contents

Robots.txt Basics