Your robots.txt file is the first thing every AI crawler reads. It is also the last line of defense most publishers never configure correctly. This guide gives you the complete picture: which bots respect it, which ignore it, and exactly what to put in yours.
Most publishers either block all AI bots out of fear or leave their robots.txt untouched and hope for the best. Both approaches cost visibility. The research is clear: robots.txt is not just an access control file. It is a discovery mechanism that AI crawlers actively parse to find content they would never encounter through normal crawling (Lee, 2026).
We tested every major AI crawler against robots.txt directives using server-side Vercel middleware logs on aiplusautomation.com. Every request was logged with its user agent, the path it hit, and the timestamp. This guide distills those findings alongside published research into a single, actionable reference.
🤖 WHAT ROBOTS.TXT ACTUALLY DOES (AND DOES NOT DO)
robots.txt is a plain text file at the root of your domain (/robots.txt) that tells web crawlers which paths they can and cannot access. It has been a web standard since 1994 and was formalized as RFC 9309 in 2022.
What it does:
- Declares which user agents can crawl which paths
- Provides Sitemap references for content discovery
- Sets crawl-delay preferences (respected by some bots)
What it does not do:
- Block content from being read (it is advisory, not enforced)
- Prevent content from appearing in AI training data retroactively
- Stop live conversational fetches from ChatGPT-User
- Guarantee any crawler will comply
The Bottom Line: robots.txt is a polite request, not a firewall. Compliant bots honor it. Non-compliant bots ignore it. And some bots, like ChatGPT-User, have been explicitly redesignated by their operators as "user extensions" rather than crawlers, exempting them from robots.txt conventions entirely.
Longpre et al. (2024) documented the scale: in a single year (2023 to 2024), over 5% of all tokens in the C4 training corpus became restricted through robots.txt, with 28% of the most actively maintained sources adding AI-specific blocks (Longpre et al., 2024).
📊 WHICH AI BOTS RESPECT ROBOTS.TXT (COMPLETE TABLE)
This is the core reference. We verified each crawler's behavior through controlled server log analysis over several months.
| Bot (User-Agent Token) | Operator | Purpose | Respects robots.txt | Notes |
|---|---|---|---|---|
| GPTBot | OpenAI | Training data collection | Yes | Also follows Sitemap directives to discover content |
| OAI-SearchBot | OpenAI | ChatGPT Search index | Yes | Separate token from GPTBot |
| ChatGPT-User | OpenAI | Live conversational fetch | No | Reclassified as "user extension" Dec 2025 |
| ClaudeBot | Anthropic | Training data collection | Yes | Standard crawler compliance |
| Claude-User | Anthropic | Live conversational fetch | Yes (session-cached) | Checks robots.txt once per session, caches result |
| PerplexityBot | Perplexity | Background index crawler | Yes | No live fetcher; all answers from pre-built index |
| Google-Extended | AI training data (Gemini) | Yes | Separate from Googlebot; controls AI training only | |
| Googlebot | Google Search (including AI Mode) | Yes | Standard Googlebot rules apply | |
| Bytespider | ByteDance | Training data collection | Yes | Often blocked due to aggressive crawl rates |
| Applebot-Extended | Apple | Apple Intelligence training | Yes | Separate from Applebot (Siri/Spotlight) |
| meta-externalagent | Meta | AI training data | Yes | Relatively new token (2024) |
| Amazonbot | Amazon | Alexa/AI training | Yes | Respects crawl-delay |
| cohere-ai | Cohere | Training data collection | Yes | Lower volume crawler |
| Diffbot | Diffbot | Web data extraction | Yes | Used by many downstream AI services |
| Timesbot | Brave | Brave Search AI features | Yes | Lower volume |
| FacebookBot | Meta | Content preview/AI | Yes | Distinct from meta-externalagent |
| anthropic-ai | Anthropic | Research crawler | Yes | Less common than ClaudeBot |
Cui et al. (2025) conducted a systematic analysis of 18 LLM bots across 582,281 robots.txt files and confirmed a significant increase in AI-specific robots.txt rules, particularly in finance and news categories. Their research also identified multiple cases of robots.txt violations, including instances where ChatGPT-User accessed restricted content (Cui et al., 2025).
The Bottom Line: Of 15+ tracked AI bots, only one -- ChatGPT-User -- ignores robots.txt outright. Every other major crawler complies. Claude is the most thorough: Claude-User checks robots.txt before its first fetch in each session and returns a "no permission" error if a path is disallowed. For the full breakdown of OpenAI's three-bot split, see our deep dive: OpenAI's Bots Have a Split Personality.
🔍 ROBOTS.TXT AS A DISCOVERY MECHANISM
Here is what most guides miss: robots.txt is not just about blocking. It is one of the most reliable ways to help AI crawlers find your content.
How GPTBot Uses Sitemap Directives
Our robots.txt includes this line:
Sitemap: https://aiplusautomation.com/site-knowledge.jsonld
That is a JSON-LD knowledge graph file. It is not a standard sitemap.xml. It is not linked from any page on the site. No internal links point to it. The only way to find it is to read robots.txt, parse the Sitemap directive, and follow the URL.
GPTBot did exactly that. Our server logs show GPTBot fetching site-knowledge.jsonld, a file discoverable only through robots.txt. This confirms that GPTBot treats robots.txt as a discovery tool, not just a permission file.
The Bottom Line: If you have structured data, knowledge graphs, or specialized content you want AI training crawlers to index, reference them in your robots.txt Sitemap directives. GPTBot actively follows these references. For more on tracking which bots actually crawl your site, see How to See Which AI Bots Crawl Your Website.
Sitemap Best Practices for AI Crawlers
- Include your XML sitemap -- standard, but essential
- Reference structured data files -- JSON-LD knowledge graphs, site-knowledge files
- Keep sitemaps fresh -- AI crawlers check for new content here first
- Use multiple Sitemap directives if you have separate data for different purposes
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/site-knowledge.jsonld
🎭 THE "BLOCKING AI IS THEATER" REALITY
Let's address the elephant in the room. Many publishers have added blanket AI bot blocks to their robots.txt:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
This feels decisive. It is mostly theater. Here is why:
1. ChatGPT-User ignores it. The bot most likely to present your content directly to users (during live ChatGPT conversations) does not check robots.txt at all. OpenAI formalized this in December 2025 by reclassifying ChatGPT-User as a "user extension" rather than a crawler.
2. Training data is retrospective. If your content was crawled before you added the block, it may already exist in training data. robots.txt is not a deletion request. It only governs future crawling.
3. Third-party caching and redistribution. Your content likely appears on aggregators, web archives, and content platforms that have their own crawl policies. Blocking the primary AI crawlers does not erase copies that exist elsewhere.
4. You lose the discovery channel. Blocking GPTBot means it never follows your Sitemap directives, never discovers your structured data, and never indexes your new content. You are invisible to the training pipeline that determines what ChatGPT knows about your brand.
5. You lose citation eligibility. Perplexity serves answers exclusively from its pre-built index (crawled by PerplexityBot). Block PerplexityBot, and you will never appear in Perplexity citations. The same logic applies to OAI-SearchBot and ChatGPT Search results.
Longpre et al. (2024) found that if current robots.txt restrictions were fully enforced, they would rapidly bias the diversity, freshness, and scaling laws for general-purpose AI systems. The effect would not be "your content is protected." The effect would be "your content does not exist in AI."
For most businesses, the better strategy is allowing AI crawlers and optimizing for visibility. Our AI Visibility service helps with exactly this.
📝 RECOMMENDED ROBOTS.TXT CONFIGURATION
Based on our research and testing, here is our recommended approach for most businesses that want to maximize AI visibility.
Template 1: Maximum AI Visibility (Recommended)
# === Standard Crawlers ===
User-agent: *
Allow: /
# === AI Training Crawlers ===
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Bytespider
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: meta-externalagent
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: cohere-ai
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: Diffbot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Timesbot
Allow: /
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/site-knowledge.jsonld
Why explicit Allow directives? While User-agent: * technically permits all bots, explicitly naming each AI crawler signals intent and makes your policy unambiguous. It also prepares per-bot sections for path-specific restrictions later.
Template 2: Allow AI Search, Block AI Training
For publishers who want to appear in AI search results but prefer not to contribute to training data:
# === Standard Crawlers ===
User-agent: *
Allow: /
# === AI Search Crawlers (Allowed) ===
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
# === AI Training Crawlers (Blocked) ===
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: anthropic-ai
Disallow: /
Sitemap: https://example.com/sitemap.xml
Trade-off: You retain search citation eligibility but lose GPTBot's Sitemap discovery benefits. A defensible middle ground for publishers with licensing concerns.
Template 3: Selective Path Blocking
For sites that want AI visibility for most content but need to protect specific sections (e.g., premium content, gated resources):
# === Standard Crawlers ===
User-agent: *
Allow: /
# === AI Crawlers: Allow site, block premium ===
User-agent: GPTBot
Allow: /
Disallow: /premium/
Disallow: /members-only/
User-agent: ClaudeBot
Allow: /
Disallow: /premium/
Disallow: /members-only/
User-agent: PerplexityBot
Allow: /
Disallow: /premium/
Disallow: /members-only/
User-agent: OAI-SearchBot
Allow: /
Disallow: /premium/
Disallow: /members-only/
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/site-knowledge.jsonld
Note: ChatGPT-User will still fetch premium paths during live conversations regardless. Server-level IP blocking is the only reliable method for that bot.
📋 COMPLETE USER-AGENT TOKEN REFERENCE
Every AI-related user-agent token you should know about, organized by operator:
OpenAI
| Token | Purpose | First Seen |
|---|---|---|
| GPTBot | Training data crawler | 2023 |
| OAI-SearchBot | ChatGPT Search index | 2024 |
| ChatGPT-User | Live conversation fetch | 2023 |
Anthropic
| Token | Purpose | First Seen |
|---|---|---|
| ClaudeBot | Training data crawler | 2024 |
| Claude-User | Live conversation fetch | 2024 |
| anthropic-ai | Research crawler | 2023 |
| Token | Purpose | First Seen |
|---|---|---|
| Google-Extended | AI training (Gemini) | 2023 |
| Googlebot | Google Search (incl. AI Mode) | Legacy |
Other Platforms
| Token | Operator | Purpose | First Seen |
|---|---|---|---|
| PerplexityBot | Perplexity | Index crawler | 2024 |
| Bytespider | ByteDance | Training crawler | 2023 |
| Applebot-Extended | Apple | AI training | 2024 |
| meta-externalagent | Meta | AI training | 2024 |
| FacebookBot | Meta | Content preview/AI | Legacy |
| Amazonbot | Amazon | Alexa/AI training | 2023 |
| cohere-ai | Cohere | Training crawler | 2024 |
| Diffbot | Diffbot | Data extraction | Legacy |
| Timesbot | Brave | Search AI features | 2024 |
For a live view of which bots are hitting your site right now, run a free AI visibility check.
⚡ CHATGPT-USER: THE BOT YOU CANNOT BLOCK WITH ROBOTS.TXT
This deserves its own section because it is the most consequential change in AI crawling behavior.
In December 2025, OpenAI updated its crawler documentation to formally remove any language suggesting ChatGPT-User would comply with robots.txt. Their stated rationale: ChatGPT-User is "a technical extension of the user" rather than an autonomous crawler. When someone asks ChatGPT to browse the web and it fetches your page, OpenAI considers that equivalent to a human clicking a link.
Our testing confirms this. ChatGPT-User fetched pages that were explicitly blocked by Disallow directives. No robots.txt request preceded the page fetch in server logs. It does not check.
How Claude handles the same scenario differently: Claude-User checks robots.txt once at the start of each session. If you have a Disallow for Claude-User, it returns a "no permission" error and logs zero server hits for the blocked path. The difference in approach is stark.
| Behavior | ChatGPT-User | Claude-User |
|---|---|---|
| Checks robots.txt | No | Yes (once per session) |
| Honors Disallow directives | No | Yes |
| Caching behavior | N/A | Session-level cache |
| Blocking method | Server-level IP block only | robots.txt works |
The Bottom Line: If you need to control ChatGPT-User access, robots.txt will not help. Your only option is blocking OpenAI's IP ranges at the server level, which also blocks GPTBot and OAI-SearchBot. This is almost always counterproductive if you want AI visibility.
For the full technical breakdown, read OpenAI's Bots Have a Split Personality.
🛡️ PRACTICAL IMPLEMENTATION CHECKLIST
Use this checklist to audit and update your robots.txt for AI visibility:
- Verify your robots.txt is accessible at
yourdomain.com/robots.txt - Check for accidental blanket blocks (e.g.,
Disallow: /underUser-agent: *) - Add explicit AI crawler rules using the templates above
- Include Sitemap directives pointing to your XML sitemap and any structured data files
- Remove outdated bot blocks (e.g., blocks for bots that no longer exist or have changed tokens)
- Test with a fetch tool to confirm your robots.txt parses correctly
- Monitor AI crawler activity to verify bots are actually hitting your site after changes
- Set a review schedule because new AI bots appear regularly and token names change
For ongoing monitoring, see our guide on how to track AI bots effectively.
🔗 HOW ROBOTS.TXT FITS INTO YOUR AI VISIBILITY STRATEGY
robots.txt is one piece of a larger optimization strategy. Here is how it connects to the rest:
1. robots.txt + Structured Data = AI Discovery. AI crawlers that follow your Sitemap directives will find your structured data (JSON-LD, schema markup). Research shows that schema markup presence is associated with 69% higher AI citation odds (Lee, 2026). robots.txt is the front door to that data.
2. robots.txt + Content Quality = Citation Eligibility. Allowing crawlers is necessary but not sufficient. The content itself needs to match query intent and include the page-level features that predict citation. Aggarwal et al. (2024) demonstrated that targeted optimization strategies can boost visibility in generative engine responses by up to 40% (Aggarwal et al., 2024). For a complete optimization framework, see our Generative Engine Optimization guide.
3. robots.txt + Monitoring = Informed Decisions. Configuring robots.txt without monitoring is flying blind. You need to verify that bots are actually crawling, how often they return, and which paths they prioritize. Our AI Visibility service provides this monitoring layer.
❓ FREQUENTLY ASKED QUESTIONS
Should I block AI bots in robots.txt? For most businesses, no. Blocking AI crawlers removes you from AI search citations (Perplexity, ChatGPT Search) and prevents discovery of your structured data. The only scenario where blocking makes sense is if you have strong licensing or copyright concerns about AI training specifically, and even then, ChatGPT-User will still fetch your content during live conversations regardless. A selective approach (Template 2 above) is a more practical middle ground.
How quickly do AI bots pick up robots.txt changes? Most AI crawlers re-fetch robots.txt every 24 to 72 hours. GPTBot and ClaudeBot typically check daily. PerplexityBot can take up to a week. Claude-User caches robots.txt at the session level, so existing sessions may not reflect changes until the next session starts. There is no way to force an immediate refresh.
Does blocking Google-Extended affect my Google Search rankings? No. Google-Extended controls only whether your content is used for Gemini AI training. It has no effect on Googlebot or your Google Search rankings (including AI Overviews and AI Mode). Google-Extended is a separate token specifically for AI training opt-out.
Can I allow AI search bots but block AI training bots? Yes. Template 2 above does exactly this. Allow OAI-SearchBot (ChatGPT Search) and PerplexityBot (Perplexity answers) while blocking GPTBot (OpenAI training), ClaudeBot (Anthropic training), and Google-Extended (Gemini training). The trade-off is losing GPTBot's Sitemap discovery behavior.
What happens if my robots.txt has conflicting rules for an AI bot?
Per RFC 9309, the most specific rule wins. If you have Allow: /blog/ and Disallow: / for the same user agent, the bot should access /blog/ paths but not other paths. However, implementation varies by bot. Test with server logs rather than assuming compliance. For tips on setting up effective monitoring, see how to track AI bots effectively.
📚 REFERENCES
- Lee, A. (2026). "Query Intent, Not Google Rank: What Best Predicts AI Citation Behavior." Preprint v5. DOI
- Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). "GEO: Generative Engine Optimization." KDD 2024. DOI
- Longpre, S., Mahari, R., Lee, A., Lund, C.P., & Oderinwale, H. (2024). "Consent in Crisis: The Rapid Decline of the AI Data Commons." Preprint. DOI
- Cui, J., Zha, M., Wang, X., & Liao, X. (2025). "The Odyssey of robots.txt Governance: Measuring Convention Implications of Web Bots in Large Language Model Services." ACM WWW 2025. DOI
- RFC 9309 (2022). "Robots Exclusion Protocol." IETF.