← Back to Blog

AI SEO EXPERIMENTS

OpenAIs Bots Have a Split Personality: One Reads robots.txt, the Other Ignores It

By Published
OpenAIs Bots Have a Split Personality: One Reads robots.txt, the Other Ignores It

OpenAI operates three separate crawler user agents. They don't all follow the same rules, and the inconsistency has real implications for publishers.

I run a site that tracks AI bot behavior with server-side Vercel middleware logs. Every request from an AI crawler gets logged with its user agent, the path it hit, and the timestamp. Over several months of monitoring and controlled testing, here's what I found about how OpenAI's bots interact with robots.txt.

GPTBot reads robots.txt - and actually follows references in it

GPTBot is OpenAI's training data crawler. It respects robots.txt. That part is well-documented.

What's less discussed is how thoroughly it parses the file. Our robots.txt includes this line:

Sitemap: https://aiplusautomation.com/site-knowledge.jsonld

That's a JSON-LD knowledge graph file - not a standard sitemap.xml. We referenced it through the Sitemap: directive in robots.txt to see if any bots would discover and follow it.

GPTBot did. Our server logs show GPTBot fetching site-knowledge.jsonld - a file that is only discoverable through robots.txt. It's not linked from any page on the site. No internal links point to it. The only way to find it is to read robots.txt, parse the Sitemap directive, and follow the URL.

This tells us GPTBot doesn't just check whether it's allowed or blocked. It actively reads robots.txt as a discovery mechanism, following Sitemap references to find content it might not encounter through normal crawling.

ChatGPT-User ignores robots.txt entirely

ChatGPT-User is the user agent that fetches pages during live conversations. When a user asks ChatGPT about something and it decides to browse the web, this is the bot that hits your server.

In our controlled tests, ChatGPT-User fetched pages that were explicitly blocked by robots.txt Disallow directives. No robots.txt request preceded the page fetch in server logs. It doesn't check. It just fetches.

This was formalized in December 2025 when OpenAI updated its crawler documentation. Before that update, there was implicit language suggesting ChatGPT-User would comply with robots.txt. After the update, OpenAI removed all such language. Their rationale: ChatGPT-User is "a technical extension of the user" rather than an autonomous crawler, so crawler-control mechanisms don't apply.

The practical effect: if you add a Disallow directive for ChatGPT-User in your robots.txt, nothing happens. ChatGPT will still fetch your pages during conversations. The only way to actually block it is to block OpenAI's IP ranges at the server level - which requires maintaining an up-to-date blocklist and is counterproductive if you want AI visibility.

The stale URL problem ChatGPT-User exposes

ChatGPT does not maintain its own web index. Its conversational fetches are routed through Bing's search index. When a user asks a question, ChatGPT runs a Bing query, gets back a list of URLs, and ChatGPT-User fetches each one to construct the answer.

The side effect is observable in our server logs. ChatGPT-User regularly fetches URLs we deleted months earlier. The URLs sit in Bing's index until Bing's crawl cycle catches up, and ChatGPT cheerfully fetches whatever Bing tells it to. We see 404 responses to ChatGPT-User on slugs we removed in February still arriving in April.

The implication: even if you can't block ChatGPT-User in robots.txt, you can starve it indirectly. Get ranked in Bing for the queries you care about (Bing Webmaster Tools, sitemap submission, indexnow pings) and ChatGPT-User will follow. Get de-indexed in Bing and ChatGPT-User loses its breadcrumbs to your site, regardless of what your robots.txt says.

OAI-SearchBot: the third bot nobody talks about

There's also OAI-SearchBot, which indexes content specifically for ChatGPT Search results (the inline citations users see). It's separate from both GPTBot and ChatGPT-User, has its own robots.txt token, and does respect robots.txt.

So the full picture for OpenAI is:

Bot Purpose Respects robots.txt
GPTBot Training data collection Yes - and follows Sitemap references
OAI-SearchBot Search index for ChatGPT citations Yes
ChatGPT-User Live page fetching during conversations No

You can block training (GPTBot) and search indexing (OAI-SearchBot) but cannot block conversational fetching (ChatGPT-User). The one bot you can't control is the one that actually reads your content in front of users.

How other AI platforms handle this

The contrast with other platforms is striking.

Claude is the most compliant. Claude-User checks robots.txt once at the start of each session before its first fetch. If you have a Disallow directive for Claude-User, it returns a "no permission" error and logs zero server hits for the blocked path. We verified this with a controlled test: we deployed a Disallow directive on a test path, and Claude refused to fetch it. The control page on an allowed path was fetched successfully.

One quirk: Claude caches robots.txt at the session level. If you add a new Disallow directive mid-session, Claude won't see it until the next session. In our testing, the gap between the robots.txt request and the first page fetch was 3.0 seconds in a fresh session and 1.1 seconds in a subsequent one.

Perplexity doesn't have a conversational fetcher at all. PerplexityBot is a background crawler that builds their index, and it respects robots.txt. All Perplexity answers are served from this pre-built index, never from live fetches.

Gemini doesn't send any identifiable AI-specific fetch requests during conversations. We monitored for 14 days and saw zero Gemini-specific hits. It grounds answers through Google's internal search infrastructure using regular Googlebot-crawled content.

Summary:

Platform Conversational bot Respects robots.txt
ChatGPT ChatGPT-User No
Claude Claude-User Yes (session-cached)
Perplexity None (index only) N/A
Gemini None (Google Search) N/A

The JavaScript blind spot all three OpenAI bots share

There is one more constraint that matters for publishers and is often missed. None of OpenAI's three bots execute JavaScript. They fetch JS files (Vercel and MERJ found GPTBot pulling JS files in 11.5% of requests across more than 500 million observed fetches), but they do not run them.

That means content rendered client-side after the page loads, common on React, Vue, or Angular sites without server-side rendering, is invisible to all three bots. GPTBot will not include client-rendered text in training data. OAI-SearchBot will not surface it in ChatGPT Search citations. ChatGPT-User will not retrieve it during a conversation. The only content that exists for OpenAI is the content present in the initial HTML response.

The fix is server-side rendering or static generation. Frameworks like Next.js, Nuxt, Astro, or plain HTML solve this by default. We have a longer breakdown in Server Side Rendering for AI Platforms.

A note on bot diversity in practice

Across the five sites we monitor with BotSight, we observe between four and eleven distinct AI bots in any given month. ChatGPT's three bots (GPTBot, OAI-SearchBot, ChatGPT-User) plus Anthropic's three or four (ClaudeBot, Claude-Web, Claude-User, sometimes Claude-SearchBot) plus PerplexityBot, plus Applebot, plus Google-Extended typically gets us to eight to ten unique bots on a healthy site.

A few bots that get announced but do not show up in production yet: xAI's Grok variants (training on X content rather than broad web), a separate Microsoft Copilot retrieval bot (Copilot still uses Bing's infrastructure), Meta AI's "Fetcher" beyond meta-externalagent, and a Gemini-specific research crawler distinct from Google-Extended. If any of these start appearing in your logs in volume, the AI crawler landscape just changed.

What this means in practice

If you're a publisher thinking about AI bot access, the situation is nuanced:

  1. robots.txt gives you partial control. You can block training data collection (GPTBot, ClaudeBot) and search indexing (OAI-SearchBot, PerplexityBot). But you cannot block ChatGPT's live conversational access through robots.txt. Only Claude respects Disallow directives during conversations.

  2. robots.txt is also a discovery tool. GPTBot following our Sitemap reference to a non-linked JSON-LD file suggests that robots.txt is worth treating as more than just an access control file. If you have structured data or knowledge graphs you want AI training crawlers to find, referencing them in robots.txt Sitemap directives is a viable discovery path.

  3. "Blocking AI" is mostly theater. If your goal is to prevent any AI from reading your content, robots.txt won't do it. ChatGPT-User ignores it, and even if you block the training crawlers, your content likely already exists in training data from before you added the block. The more practical question is whether you want to optimize for AI visibility or accept whatever level of access the platforms take by default.

  4. The opt-out framing is backwards. OpenAI's framing of ChatGPT-User as a "user extension" rather than a crawler shifts the consent model. Traditional robots.txt assumes you're dealing with automated crawlers. OpenAI is saying: this is a person browsing through our tool, so your crawler rules don't apply. Whether you agree with that framing is a policy question, not a technical one.

Limitations

  • Our testing is on a single site (aiplusautomation.com on Vercel). Behavior may differ on other hosting platforms or for sites with different characteristics.
  • The GPTBot/site-knowledge.jsonld observation is from production logs, not a controlled A/B test. We can confirm GPTBot fetched the file and that the file is only referenced in robots.txt, but we can't rule out other discovery mechanisms we didn't account for.
  • ChatGPT-User non-compliance was tested with a limited number of blocked paths. We didn't test every edge case.
  • Platform behavior changes. OpenAI could start respecting robots.txt for ChatGPT-User tomorrow (or any other platform could stop).

References