← Back to Blog

AI TOOLS

robots.txt for AI Bots: The Complete Reference Guide (2026)

2026-03-24

robots.txt for AI Bots: The Complete Reference Guide (2026)

Your robots.txt file is the first thing every AI crawler reads. It is also the last line of defense most publishers never configure correctly. This guide gives you the complete picture: which bots respect it, which ignore it, and exactly what to put in yours.

Most publishers either block all AI bots out of fear or leave their robots.txt untouched and hope for the best. Both approaches cost visibility. The research is clear: robots.txt is not just an access control file. It is a discovery mechanism that AI crawlers actively parse to find content they would never encounter through normal crawling (Lee, 2026).

We tested every major AI crawler against robots.txt directives using server-side Vercel middleware logs on aiplusautomation.com. Every request was logged with its user agent, the path it hit, and the timestamp. This guide distills those findings alongside published research into a single, actionable reference.

🤖 WHAT ROBOTS.TXT ACTUALLY DOES (AND DOES NOT DO)

robots.txt is a plain text file at the root of your domain (/robots.txt) that tells web crawlers which paths they can and cannot access. It has been a web standard since 1994 and was formalized as RFC 9309 in 2022.

What it does:

  • Declares which user agents can crawl which paths
  • Provides Sitemap references for content discovery
  • Sets crawl-delay preferences (respected by some bots)

What it does not do:

  • Block content from being read (it is advisory, not enforced)
  • Prevent content from appearing in AI training data retroactively
  • Stop live conversational fetches from ChatGPT-User
  • Guarantee any crawler will comply

The Bottom Line: robots.txt is a polite request, not a firewall. Compliant bots honor it. Non-compliant bots ignore it. And some bots, like ChatGPT-User, have been explicitly redesignated by their operators as "user extensions" rather than crawlers, exempting them from robots.txt conventions entirely.

Longpre et al. (2024) documented the scale: in a single year (2023 to 2024), over 5% of all tokens in the C4 training corpus became restricted through robots.txt, with 28% of the most actively maintained sources adding AI-specific blocks (Longpre et al., 2024).

📊 WHICH AI BOTS RESPECT ROBOTS.TXT (COMPLETE TABLE)

This is the core reference. We verified each crawler's behavior through controlled server log analysis over several months.

Bot (User-Agent Token) Operator Purpose Respects robots.txt Notes
GPTBot OpenAI Training data collection Yes Also follows Sitemap directives to discover content
OAI-SearchBot OpenAI ChatGPT Search index Yes Separate token from GPTBot
ChatGPT-User OpenAI Live conversational fetch No Reclassified as "user extension" Dec 2025
ClaudeBot Anthropic Training data collection Yes Standard crawler compliance
Claude-User Anthropic Live conversational fetch Yes (session-cached) Checks robots.txt once per session, caches result
PerplexityBot Perplexity Background index crawler Yes No live fetcher; all answers from pre-built index
Google-Extended Google AI training data (Gemini) Yes Separate from Googlebot; controls AI training only
Googlebot Google Google Search (including AI Mode) Yes Standard Googlebot rules apply
Bytespider ByteDance Training data collection Yes Often blocked due to aggressive crawl rates
Applebot-Extended Apple Apple Intelligence training Yes Separate from Applebot (Siri/Spotlight)
meta-externalagent Meta AI training data Yes Relatively new token (2024)
Amazonbot Amazon Alexa/AI training Yes Respects crawl-delay
cohere-ai Cohere Training data collection Yes Lower volume crawler
Diffbot Diffbot Web data extraction Yes Used by many downstream AI services
Timesbot Brave Brave Search AI features Yes Lower volume
FacebookBot Meta Content preview/AI Yes Distinct from meta-externalagent
anthropic-ai Anthropic Research crawler Yes Less common than ClaudeBot

Cui et al. (2025) conducted a systematic analysis of 18 LLM bots across 582,281 robots.txt files and confirmed a significant increase in AI-specific robots.txt rules, particularly in finance and news categories. Their research also identified multiple cases of robots.txt violations, including instances where ChatGPT-User accessed restricted content (Cui et al., 2025).

The Bottom Line: Of 15+ tracked AI bots, only one -- ChatGPT-User -- ignores robots.txt outright. Every other major crawler complies. Claude is the most thorough: Claude-User checks robots.txt before its first fetch in each session and returns a "no permission" error if a path is disallowed. For the full breakdown of OpenAI's three-bot split, see our deep dive: OpenAI's Bots Have a Split Personality.

🔍 ROBOTS.TXT AS A DISCOVERY MECHANISM

Here is what most guides miss: robots.txt is not just about blocking. It is one of the most reliable ways to help AI crawlers find your content.

How GPTBot Uses Sitemap Directives

Our robots.txt includes this line:

Sitemap: https://aiplusautomation.com/site-knowledge.jsonld

That is a JSON-LD knowledge graph file. It is not a standard sitemap.xml. It is not linked from any page on the site. No internal links point to it. The only way to find it is to read robots.txt, parse the Sitemap directive, and follow the URL.

GPTBot did exactly that. Our server logs show GPTBot fetching site-knowledge.jsonld, a file discoverable only through robots.txt. This confirms that GPTBot treats robots.txt as a discovery tool, not just a permission file.

The Bottom Line: If you have structured data, knowledge graphs, or specialized content you want AI training crawlers to index, reference them in your robots.txt Sitemap directives. GPTBot actively follows these references. For more on tracking which bots actually crawl your site, see How to See Which AI Bots Crawl Your Website.

Sitemap Best Practices for AI Crawlers

  1. Include your XML sitemap -- standard, but essential
  2. Reference structured data files -- JSON-LD knowledge graphs, site-knowledge files
  3. Keep sitemaps fresh -- AI crawlers check for new content here first
  4. Use multiple Sitemap directives if you have separate data for different purposes
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/site-knowledge.jsonld

🎭 THE "BLOCKING AI IS THEATER" REALITY

Let's address the elephant in the room. Many publishers have added blanket AI bot blocks to their robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

This feels decisive. It is mostly theater. Here is why:

1. ChatGPT-User ignores it. The bot most likely to present your content directly to users (during live ChatGPT conversations) does not check robots.txt at all. OpenAI formalized this in December 2025 by reclassifying ChatGPT-User as a "user extension" rather than a crawler.

2. Training data is retrospective. If your content was crawled before you added the block, it may already exist in training data. robots.txt is not a deletion request. It only governs future crawling.

3. Third-party caching and redistribution. Your content likely appears on aggregators, web archives, and content platforms that have their own crawl policies. Blocking the primary AI crawlers does not erase copies that exist elsewhere.

4. You lose the discovery channel. Blocking GPTBot means it never follows your Sitemap directives, never discovers your structured data, and never indexes your new content. You are invisible to the training pipeline that determines what ChatGPT knows about your brand.

5. You lose citation eligibility. Perplexity serves answers exclusively from its pre-built index (crawled by PerplexityBot). Block PerplexityBot, and you will never appear in Perplexity citations. The same logic applies to OAI-SearchBot and ChatGPT Search results.

Longpre et al. (2024) found that if current robots.txt restrictions were fully enforced, they would rapidly bias the diversity, freshness, and scaling laws for general-purpose AI systems. The effect would not be "your content is protected." The effect would be "your content does not exist in AI."

For most businesses, the better strategy is allowing AI crawlers and optimizing for visibility. Our AI Visibility service helps with exactly this.

📝 RECOMMENDED ROBOTS.TXT CONFIGURATION

Based on our research and testing, here is our recommended approach for most businesses that want to maximize AI visibility.

Template 1: Maximum AI Visibility (Recommended)

# === Standard Crawlers ===
User-agent: *
Allow: /

# === AI Training Crawlers ===
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Bytespider
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: meta-externalagent
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Diffbot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Timesbot
Allow: /

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/site-knowledge.jsonld

Why explicit Allow directives? While User-agent: * technically permits all bots, explicitly naming each AI crawler signals intent and makes your policy unambiguous. It also prepares per-bot sections for path-specific restrictions later.

Template 2: Allow AI Search, Block AI Training

For publishers who want to appear in AI search results but prefer not to contribute to training data:

# === Standard Crawlers ===
User-agent: *
Allow: /

# === AI Search Crawlers (Allowed) ===
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# === AI Training Crawlers (Blocked) ===
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: anthropic-ai
Disallow: /

Sitemap: https://example.com/sitemap.xml

Trade-off: You retain search citation eligibility but lose GPTBot's Sitemap discovery benefits. A defensible middle ground for publishers with licensing concerns.

Template 3: Selective Path Blocking

For sites that want AI visibility for most content but need to protect specific sections (e.g., premium content, gated resources):

# === Standard Crawlers ===
User-agent: *
Allow: /

# === AI Crawlers: Allow site, block premium ===
User-agent: GPTBot
Allow: /
Disallow: /premium/
Disallow: /members-only/

User-agent: ClaudeBot
Allow: /
Disallow: /premium/
Disallow: /members-only/

User-agent: PerplexityBot
Allow: /
Disallow: /premium/
Disallow: /members-only/

User-agent: OAI-SearchBot
Allow: /
Disallow: /premium/
Disallow: /members-only/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/site-knowledge.jsonld

Note: ChatGPT-User will still fetch premium paths during live conversations regardless. Server-level IP blocking is the only reliable method for that bot.

📋 COMPLETE USER-AGENT TOKEN REFERENCE

Every AI-related user-agent token you should know about, organized by operator:

OpenAI

Token Purpose First Seen
GPTBot Training data crawler 2023
OAI-SearchBot ChatGPT Search index 2024
ChatGPT-User Live conversation fetch 2023

Anthropic

Token Purpose First Seen
ClaudeBot Training data crawler 2024
Claude-User Live conversation fetch 2024
anthropic-ai Research crawler 2023

Google

Token Purpose First Seen
Google-Extended AI training (Gemini) 2023
Googlebot Google Search (incl. AI Mode) Legacy

Other Platforms

Token Operator Purpose First Seen
PerplexityBot Perplexity Index crawler 2024
Bytespider ByteDance Training crawler 2023
Applebot-Extended Apple AI training 2024
meta-externalagent Meta AI training 2024
FacebookBot Meta Content preview/AI Legacy
Amazonbot Amazon Alexa/AI training 2023
cohere-ai Cohere Training crawler 2024
Diffbot Diffbot Data extraction Legacy
Timesbot Brave Search AI features 2024

For a live view of which bots are hitting your site right now, run a free AI visibility check.

⚡ CHATGPT-USER: THE BOT YOU CANNOT BLOCK WITH ROBOTS.TXT

This deserves its own section because it is the most consequential change in AI crawling behavior.

In December 2025, OpenAI updated its crawler documentation to formally remove any language suggesting ChatGPT-User would comply with robots.txt. Their stated rationale: ChatGPT-User is "a technical extension of the user" rather than an autonomous crawler. When someone asks ChatGPT to browse the web and it fetches your page, OpenAI considers that equivalent to a human clicking a link.

Our testing confirms this. ChatGPT-User fetched pages that were explicitly blocked by Disallow directives. No robots.txt request preceded the page fetch in server logs. It does not check.

How Claude handles the same scenario differently: Claude-User checks robots.txt once at the start of each session. If you have a Disallow for Claude-User, it returns a "no permission" error and logs zero server hits for the blocked path. The difference in approach is stark.

Behavior ChatGPT-User Claude-User
Checks robots.txt No Yes (once per session)
Honors Disallow directives No Yes
Caching behavior N/A Session-level cache
Blocking method Server-level IP block only robots.txt works

The Bottom Line: If you need to control ChatGPT-User access, robots.txt will not help. Your only option is blocking OpenAI's IP ranges at the server level, which also blocks GPTBot and OAI-SearchBot. This is almost always counterproductive if you want AI visibility.

For the full technical breakdown, read OpenAI's Bots Have a Split Personality.

🛡️ PRACTICAL IMPLEMENTATION CHECKLIST

Use this checklist to audit and update your robots.txt for AI visibility:

  • Verify your robots.txt is accessible at yourdomain.com/robots.txt
  • Check for accidental blanket blocks (e.g., Disallow: / under User-agent: *)
  • Add explicit AI crawler rules using the templates above
  • Include Sitemap directives pointing to your XML sitemap and any structured data files
  • Remove outdated bot blocks (e.g., blocks for bots that no longer exist or have changed tokens)
  • Test with a fetch tool to confirm your robots.txt parses correctly
  • Monitor AI crawler activity to verify bots are actually hitting your site after changes
  • Set a review schedule because new AI bots appear regularly and token names change

For ongoing monitoring, see our guide on how to track AI bots effectively.

🔗 HOW ROBOTS.TXT FITS INTO YOUR AI VISIBILITY STRATEGY

robots.txt is one piece of a larger optimization strategy. Here is how it connects to the rest:

1. robots.txt + Structured Data = AI Discovery. AI crawlers that follow your Sitemap directives will find your structured data (JSON-LD, schema markup). Research shows that schema markup presence is associated with 69% higher AI citation odds (Lee, 2026). robots.txt is the front door to that data.

2. robots.txt + Content Quality = Citation Eligibility. Allowing crawlers is necessary but not sufficient. The content itself needs to match query intent and include the page-level features that predict citation. Aggarwal et al. (2024) demonstrated that targeted optimization strategies can boost visibility in generative engine responses by up to 40% (Aggarwal et al., 2024). For a complete optimization framework, see our Generative Engine Optimization guide.

3. robots.txt + Monitoring = Informed Decisions. Configuring robots.txt without monitoring is flying blind. You need to verify that bots are actually crawling, how often they return, and which paths they prioritize. Our AI Visibility service provides this monitoring layer.

❓ FREQUENTLY ASKED QUESTIONS

Should I block AI bots in robots.txt? For most businesses, no. Blocking AI crawlers removes you from AI search citations (Perplexity, ChatGPT Search) and prevents discovery of your structured data. The only scenario where blocking makes sense is if you have strong licensing or copyright concerns about AI training specifically, and even then, ChatGPT-User will still fetch your content during live conversations regardless. A selective approach (Template 2 above) is a more practical middle ground.

How quickly do AI bots pick up robots.txt changes? Most AI crawlers re-fetch robots.txt every 24 to 72 hours. GPTBot and ClaudeBot typically check daily. PerplexityBot can take up to a week. Claude-User caches robots.txt at the session level, so existing sessions may not reflect changes until the next session starts. There is no way to force an immediate refresh.

Does blocking Google-Extended affect my Google Search rankings? No. Google-Extended controls only whether your content is used for Gemini AI training. It has no effect on Googlebot or your Google Search rankings (including AI Overviews and AI Mode). Google-Extended is a separate token specifically for AI training opt-out.

Can I allow AI search bots but block AI training bots? Yes. Template 2 above does exactly this. Allow OAI-SearchBot (ChatGPT Search) and PerplexityBot (Perplexity answers) while blocking GPTBot (OpenAI training), ClaudeBot (Anthropic training), and Google-Extended (Gemini training). The trade-off is losing GPTBot's Sitemap discovery behavior.

What happens if my robots.txt has conflicting rules for an AI bot? Per RFC 9309, the most specific rule wins. If you have Allow: /blog/ and Disallow: / for the same user agent, the bot should access /blog/ paths but not other paths. However, implementation varies by bot. Test with server logs rather than assuming compliance. For tips on setting up effective monitoring, see how to track AI bots effectively.

📚 REFERENCES

  • Lee, A. (2026). "Query Intent, Not Google Rank: What Best Predicts AI Citation Behavior." Preprint v5. DOI
  • Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). "GEO: Generative Engine Optimization." KDD 2024. DOI
  • Longpre, S., Mahari, R., Lee, A., Lund, C.P., & Oderinwale, H. (2024). "Consent in Crisis: The Rapid Decline of the AI Data Commons." Preprint. DOI
  • Cui, J., Zha, M., Wang, X., & Liao, X. (2025). "The Odyssey of robots.txt Governance: Measuring Convention Implications of Web Bots in Large Language Model Services." ACM WWW 2025. DOI
  • RFC 9309 (2022). "Robots Exclusion Protocol." IETF.