OpenAI operates three separate crawler user agents. They don't all follow the same rules, and the inconsistency has real implications for publishers.
I run a site that tracks AI bot behavior with server-side Vercel middleware logs. Every request from an AI crawler gets logged with its user agent, the path it hit, and the timestamp. Over several months of monitoring and controlled testing, here's what I found about how OpenAI's bots interact with robots.txt.
GPTBot reads robots.txt - and actually follows references in it
GPTBot is OpenAI's training data crawler. It respects robots.txt. That part is well-documented.
What's less discussed is how thoroughly it parses the file. Our robots.txt includes this line:
Sitemap: https://aiplusautomation.com/site-knowledge.jsonld
That's a JSON-LD knowledge graph file - not a standard sitemap.xml. We referenced it through the Sitemap: directive in robots.txt to see if any bots would discover and follow it.
GPTBot did. Our server logs show GPTBot fetching site-knowledge.jsonld - a file that is only discoverable through robots.txt. It's not linked from any page on the site. No internal links point to it. The only way to find it is to read robots.txt, parse the Sitemap directive, and follow the URL.
This tells us GPTBot doesn't just check whether it's allowed or blocked. It actively reads robots.txt as a discovery mechanism, following Sitemap references to find content it might not encounter through normal crawling.
ChatGPT-User ignores robots.txt entirely
ChatGPT-User is the user agent that fetches pages during live conversations. When a user asks ChatGPT about something and it decides to browse the web, this is the bot that hits your server.
In our controlled tests, ChatGPT-User fetched pages that were explicitly blocked by robots.txt Disallow directives. No robots.txt request preceded the page fetch in server logs. It doesn't check. It just fetches.
This was formalized in December 2025 when OpenAI updated its crawler documentation. Before that update, there was implicit language suggesting ChatGPT-User would comply with robots.txt. After the update, OpenAI removed all such language. Their rationale: ChatGPT-User is "a technical extension of the user" rather than an autonomous crawler, so crawler-control mechanisms don't apply.
The practical effect: if you add a Disallow directive for ChatGPT-User in your robots.txt, nothing happens. ChatGPT will still fetch your pages during conversations. The only way to actually block it is to block OpenAI's IP ranges at the server level - which requires maintaining an up-to-date blocklist and is counterproductive if you want AI visibility.
OAI-SearchBot: the third bot nobody talks about
There's also OAI-SearchBot, which indexes content specifically for ChatGPT Search results (the inline citations users see). It's separate from both GPTBot and ChatGPT-User, has its own robots.txt token, and does respect robots.txt.
So the full picture for OpenAI is:
| Bot | Purpose | Respects robots.txt |
|---|---|---|
| GPTBot | Training data collection | Yes - and follows Sitemap references |
| OAI-SearchBot | Search index for ChatGPT citations | Yes |
| ChatGPT-User | Live page fetching during conversations | No |
You can block training (GPTBot) and search indexing (OAI-SearchBot) but cannot block conversational fetching (ChatGPT-User). The one bot you can't control is the one that actually reads your content in front of users.
How other AI platforms handle this
The contrast with other platforms is striking.
Claude is the most compliant. Claude-User checks robots.txt once at the start of each session before its first fetch. If you have a Disallow directive for Claude-User, it returns a "no permission" error and logs zero server hits for the blocked path. We verified this with a controlled test: we deployed a Disallow directive on a test path, and Claude refused to fetch it. The control page on an allowed path was fetched successfully.
One quirk: Claude caches robots.txt at the session level. If you add a new Disallow directive mid-session, Claude won't see it until the next session. In our testing, the gap between the robots.txt request and the first page fetch was 3.0 seconds in a fresh session and 1.1 seconds in a subsequent one.
Perplexity doesn't have a conversational fetcher at all. PerplexityBot is a background crawler that builds their index, and it respects robots.txt. All Perplexity answers are served from this pre-built index, never from live fetches.
Gemini doesn't send any identifiable AI-specific fetch requests during conversations. We monitored for 14 days and saw zero Gemini-specific hits. It grounds answers through Google's internal search infrastructure using regular Googlebot-crawled content.
Summary:
| Platform | Conversational bot | Respects robots.txt |
|---|---|---|
| ChatGPT | ChatGPT-User | No |
| Claude | Claude-User | Yes (session-cached) |
| Perplexity | None (index only) | N/A |
| Gemini | None (Google Search) | N/A |
What this means in practice
If you're a publisher thinking about AI bot access, the situation is nuanced:
robots.txt gives you partial control. You can block training data collection (GPTBot, ClaudeBot) and search indexing (OAI-SearchBot, PerplexityBot). But you cannot block ChatGPT's live conversational access through robots.txt. Only Claude respects Disallow directives during conversations.
robots.txt is also a discovery tool. GPTBot following our Sitemap reference to a non-linked JSON-LD file suggests that robots.txt is worth treating as more than just an access control file. If you have structured data or knowledge graphs you want AI training crawlers to find, referencing them in robots.txt Sitemap directives is a viable discovery path.
"Blocking AI" is mostly theater. If your goal is to prevent any AI from reading your content, robots.txt won't do it. ChatGPT-User ignores it, and even if you block the training crawlers, your content likely already exists in training data from before you added the block. The more practical question is whether you want to optimize for AI visibility or accept whatever level of access the platforms take by default.
The opt-out framing is backwards. OpenAI's framing of ChatGPT-User as a "user extension" rather than a crawler shifts the consent model. Traditional robots.txt assumes you're dealing with automated crawlers. OpenAI is saying: this is a person browsing through our tool, so your crawler rules don't apply. Whether you agree with that framing is a policy question, not a technical one.
Limitations
- Our testing is on a single site (aiplusautomation.com on Vercel). Behavior may differ on other hosting platforms or for sites with different characteristics.
- The GPTBot/site-knowledge.jsonld observation is from production logs, not a controlled A/B test. We can confirm GPTBot fetched the file and that the file is only referenced in robots.txt, but we can't rule out other discovery mechanisms we didn't account for.
- ChatGPT-User non-compliance was tested with a limited number of blocked paths. We didn't test every edge case.
- Platform behavior changes. OpenAI could start respecting robots.txt for ChatGPT-User tomorrow (or any other platform could stop).
References
- OpenAI crawler documentation: GPTBot, OAI-SearchBot, ChatGPT-User
- December 2025 policy change analysis: Scalevise, PPC Land
- Our AI citation research: Query Intent, Not Google Rank (DOI: 10.5281/zenodo.18653093)