Complete answers to every robots.txt question, from basic syntax to advanced crawler control. Each answer provides the essential information you need without fluff.
Table of Contents
- Robots.txt Basics
- Syntax and Directives
- Blocking and Allowing
- Specific Crawlers
- Testing and Validation
- Common Mistakes
Robots.txt Basics
What is a robots.txt file?
A text file placed at your domain root (example.com/robots.txt) that tells search engine crawlers which URLs they can access. It uses the Robots Exclusion Protocol. Important: it's a request, not enforcement. Good bots respect it; malicious bots ignore it.
Where should robots.txt be located?
Always at the domain root: https://example.com/robots.txt. Subdirectories don't work. Each subdomain needs its own file. The file must return HTTP 200 status code to be valid and readable by crawlers.
Is robots.txt required for SEO?
No, it's optional. Without one, crawlers assume full access. However, having one helps control crawl budget, block non-essential pages, and declare sitemap location. A missing file returns 404, which means "crawl everything."
Does robots.txt provide security?
No. The file is publicly readable at /robots.txt. Anyone can view what you're blocking. For actual security, use password protection or authentication. Never use robots.txt to hide sensitive content from bad actors.
Does robots.txt prevent indexing?
No. Blocking crawling differs from blocking indexing. Google may still index blocked URLs found via links, showing "No information available." Use noindex meta tag or X-Robots-Tag header to actually prevent indexing.
What's the difference between crawling and indexing?
Crawling means bots access and read your page. Indexing means search engines add pages to their database. Robots.txt controls crawling only. If you block crawling, bots can't see noindex tags on the page.
How do I create a robots.txt file?
Create a plain text file named exactly "robots.txt" (lowercase). Add your directives, one per line. Upload to your domain root via FTP or CMS. Verify by visiting yourdomain.com/robots.txt in a browser.
Syntax and Directives
What does User-agent mean?
User-agent specifies which crawler the rules apply to. "User-agent: *" targets all bots. "User-agent: Googlebot" targets only Google. You can have multiple User-agent blocks with different rules for each crawler.
What does Disallow mean?
Disallow tells crawlers not to access matching URLs. "Disallow: /admin/" blocks the admin directory. "Disallow: /" blocks everything. "Disallow:" with empty value allows everything. Paths are case-sensitive.
What does Allow mean?
Allow explicitly permits crawling of URLs otherwise blocked by Disallow. Useful for exceptions: block /private/ but Allow: /private/public-page.html. Googlebot and Bingbot support Allow; some crawlers don't.
What is the Sitemap directive?
Tells crawlers your XML sitemap location. Format: Sitemap: https://example.com/sitemap.xml. Place at file bottom. You can list multiple sitemaps. Helps discovery even without Search Console submission.
What is Crawl-delay?
Requests bots wait X seconds between requests. "Crawl-delay: 10" means 10-second gaps. Googlebot ignores this directive entirely; use Search Console instead. Bingbot respects it. Excessive delays hurt indexation.
Can I use wildcards in robots.txt?
Googlebot and Bingbot support * (matches any characters) and $ (matches URL end). Example: "Disallow: /*.pdf$" blocks all PDFs. "Disallow: /*?sort=" blocks sort parameters. Basic crawlers only understand literal paths.
Is robots.txt case-sensitive?
Directive names (User-agent, Disallow) are case-insensitive. URL paths are case-sensitive. "/Admin/" and "/admin/" are different. Always match your actual URL casing exactly when writing rules.
How do I add comments?
Use # for comments. Everything after # on a line is ignored. Example: "# Block admin section" before the rule. Comments document your rules for future reference and help team members understand intent.
Blocking and Allowing
How do I block a directory?
Use: Disallow: /directory-name/ with trailing slash. This blocks the directory and all contents. Without trailing slash, "/private" also blocks "/private-info.html" which you may not intend.
How do I block a single page?
Use the exact path: Disallow: /page-to-block.html. For dynamic URLs, include parameters: Disallow: /page?id=123. Be precise to avoid accidentally blocking other pages with similar paths.
How do I block URL parameters?
Use wildcards: "Disallow: /*?sort=" blocks URLs with sort parameter. "Disallow: /*&sessionid=" blocks session IDs. Alternatively, use canonical tags or Search Console parameter handling for complex cases.
How do I block specific file types?
Use wildcard with $: "Disallow: /*.pdf$" blocks PDFs. "Disallow: /*.doc$" blocks Word files. The $ ensures matching file extension at URL end, not just containing that string mid-path.
Should I block internal search results?
Generally yes. Internal search creates infinite URL variations, wastes crawl budget, provides little SEO value. Use "Disallow: /search" or "Disallow: /?s=" for WordPress. Consider noindex as alternative approach.
How do I block a staging site?
On staging, use: User-agent: * and Disallow: /. This blocks all crawlers from all pages. Also add noindex meta tags as backup. Better approach: password-protect staging or restrict access by IP.
How do I allow one page in a blocked directory?
Combine Disallow and Allow. Example: Disallow: /members/ then Allow: /members/join.html. Googlebot uses the most specific match. The /members/join.html page remains accessible while directory is blocked.
Specific Crawlers
What is the Googlebot user-agent?
Googlebot is Google's main web crawler. Variations include Googlebot-Image, Googlebot-News, Googlebot-Video. Target all with "User-agent: Googlebot" or specify variants. Variant rules override general Googlebot rules.
What is Bingbot?
Microsoft's crawler for Bing search. Also powers Yahoo and DuckDuckGo partially. Target with "User-agent: Bingbot". Respects Crawl-delay unlike Googlebot. Important for AI visibility since some AI uses Bing's index.
How do I block AI crawlers?
Add separate User-agent blocks: GPTBot (OpenAI training), ChatGPT-User (browsing), OAI-SearchBot (SearchGPT), ClaudeBot (Anthropic), Google-Extended (Gemini training), PerplexityBot. Each needs its own Disallow: / rule.
What is GPTBot?
OpenAI's crawler for AI training data. Blocking GPTBot prevents use in future model training. Does NOT affect real-time ChatGPT browsing (that's ChatGPT-User). Block with: User-agent: GPTBot then Disallow: /.
What is Google-Extended?
Google's crawler for Gemini/Bard AI training. Blocking prevents AI training use but does NOT affect Search rankings or AI Overviews. Those use regular Googlebot. Safe to block for AI training opt-out only.
Can I set different rules for different crawlers?
Yes. Create separate User-agent blocks. Allow Googlebot everywhere but block GPTBot entirely. Each crawler follows its own User-agent section rules, or falls back to User-agent: * if no specific match exists.
Common AI Crawler User-Agents
| Crawler | Company | Purpose |
|---|---|---|
| GPTBot | OpenAI | Model training |
| ChatGPT-User | OpenAI | Real-time browsing |
| OAI-SearchBot | OpenAI | SearchGPT index |
| ClaudeBot | Anthropic | Model training |
| Google-Extended | Gemini training | |
| PerplexityBot | Perplexity | Search index |
Testing and Validation
How do I test my robots.txt?
Use Google Search Console's robots.txt Tester under Settings. Enter URLs to check blocking status. Also test by visiting /robots.txt directly, use Screaming Frog's validator, or online robots.txt checkers.
How do I check if a URL is blocked?
In Search Console, use URL Inspection tool; it shows "Blocked by robots.txt" if applicable. The robots.txt Tester shows which specific rule blocks each URL. Screaming Frog flags blocked URLs during crawls.
Why isn't my robots.txt working?
Common causes: wrong file location, syntax errors, file returns error status, caching delays (crawlers cache 24+ hours), wrong user-agent name, case mismatch in paths, or expecting it to block indexing.
How often do crawlers refresh robots.txt?
Googlebot caches robots.txt for up to 24 hours, sometimes longer. Changes don't take effect immediately. No way to force refresh. For urgent changes, implement server-side blocks while waiting for cache.
What happens with syntax errors?
Crawlers may ignore malformed rules or the entire file. Common errors: spaces before directives, missing colons, typos in directive names. Always validate syntax using Search Console or validators before deploying.
Common Mistakes
Should I block CSS and JavaScript?
No. Google needs CSS/JS to render pages correctly. Blocking prevents proper rendering and hurts rankings. This outdated advice no longer applies. Allow Googlebot access to all resources needed for rendering.
Should I block images?
Generally no. Blocked images won't appear in Google Images and may affect page understanding. Only block for specific reasons like bandwidth or truly private images needing protection from all crawlers.
Can I put noindex in robots.txt?
No. Google deprecated the "Noindex:" directive in robots.txt in 2019. It's completely ignored now. Use noindex meta tag in HTML or X-Robots-Tag HTTP header instead for index control.
Can I block crawling AND use noindex?
Not together effectively. If crawling is blocked, bots never see the noindex tag. For blocked pages needing noindex, use X-Robots-Tag HTTP header instead, which works before page content loads.
Does the trailing slash matter?
Yes. "Disallow: /folder" blocks /folder, /folder/, and /folder-name.html. "Disallow: /folder/" only blocks /folder/ directory contents. Use trailing slash for directories to be precise about scope.
What if I accidentally block everything?
"Disallow: /" under "User-agent: *" blocks all crawlers from all pages. Your entire site disappears from search over time. Always test robots.txt after editing. Verify critical pages with URL Inspection.
Robots.txt Quick Reference
| Rule | Effect |
|---|---|
| Disallow: / | Block entire site |
| Disallow: | Allow entire site (empty value) |
| Disallow: /folder/ | Block directory and contents |
| Disallow: /*.pdf$ | Block all PDF files |
| Allow: /folder/page.html | Exception within blocked directory |
Advanced Usage
Can I list multiple sitemaps?
Yes. Add multiple Sitemap directives, each on its own line. Example: Sitemap: https://example.com/sitemap-posts.xml then Sitemap: https://example.com/sitemap-pages.xml. All will be discovered by crawlers.
What is the Host directive?
A Yandex-specific directive indicating preferred domain version. Example: Host: www.example.com. Google ignores it; use canonical tags and Search Console settings instead for preferred domain specification.
What is the Clean-param directive?
Yandex-specific directive for handling URL parameters. Tells Yandex which parameters don't affect page content. Google ignores it; use Search Console's URL Parameters tool or canonical tags instead.
Do subdomains share robots.txt?
No. Each subdomain needs its own robots.txt file. blog.example.com/robots.txt is completely separate from example.com/robots.txt. Rules don't cascade or inherit across subdomains.
Do HTTP and HTTPS need separate files?
Technically yes, they're different protocols. However, you should redirect HTTP to HTTPS anyway. Maintain robots.txt on your canonical HTTPS version. The redirect handles HTTP requests.
Does my CDN need robots.txt?
If your CDN uses a different domain (cdn.example.com), it may need its own robots.txt. For same-domain CDN setups, your main robots.txt applies. Check if CDN assets are being crawled unexpectedly.
Is there a size limit?
Google limits processing to 500KB. Larger files are truncated. Most sites never approach this limit. If yours does, consolidate rules using wildcards or reconsider what truly needs blocking.
Can I use non-ASCII characters?
Robots.txt should use UTF-8 encoding for international characters in paths. Most crawlers handle UTF-8 correctly. Avoid special characters in directive names; those must be ASCII only.
Does rule order matter?
For Googlebot, specificity wins over order. More specific paths take precedence. For other crawlers, order may matter; place Allow before Disallow for safety. Test your specific use case.
How do I edit robots.txt in WordPress?
WordPress generates virtual robots.txt by default. Edit via Yoast SEO or Rank Math settings, or create a physical file in your root directory. Physical file overrides virtual generation.
How do I edit robots.txt in Shopify?
Shopify allows robots.txt customization through theme.liquid file using the robots.txt.liquid template. Access via Online Store > Themes > Edit Code. Shopify has default rules you can modify.
How do I verify bots are respecting robots.txt?
Check server log files for bot activity. Look for crawl attempts on blocked paths; compliant bots won't request them. Non-compliant or malicious bots may ignore rules and crawl anyway.
About SEO ProCheck
Technical SEO consulting and GEO strategy with 20 years of enterprise experience. Case studies, resources, and tools for search and AI visibility.
Work With Me
Technical SEO audits, GEO strategy, site migrations, and international SEO. Hourly consulting for teams who need hands-on support, not just reports.
Subscribe to our newsletter!
Recent Posts
- No Social Schema December 7, 2025
- Missing Social Profile Links December 7, 2025
- Social Image Wrong Size December 7, 2025
