Your site is being crawled by more than 15 AI bots right now. Google Analytics cannot see any of them. This guide covers every user agent string, every robots.txt rule, and every monitoring method you need to take control of AI crawl access in 2026.
AI crawlers do not run JavaScript. They send HTTP requests, download raw HTML, and leave. Google Analytics 4, Plausible, Fathom, and every other client-side analytics tool will never register a single AI bot visit. You could have GPTBot hitting your site 10,000 times a day and your analytics dashboard would show zero.
This is not a minor gap. Lee (2026a) found that AI platforms like ChatGPT and Claude perform live page fetches during search queries, while Perplexity and Gemini rely on pre-built indices from regular crawling. Understanding crawl behavior is the first step to controlling your AI search visibility. You cannot optimize what you cannot measure.
This guide consolidates everything you need: the full user agent reference table, robots.txt compliance data, copy-paste configuration templates, monitoring method comparisons, and the traffic patterns that tell you where your content stands.
🚫 WHY GA4 MISSES AI BOT TRAFFIC
GA4 relies on a JavaScript snippet (gtag.js) that runs in the visitor's browser. When a human visits your page, their browser executes the script, which sends a tracking event to Google's servers.
AI crawlers skip all of that. They request the page, parse the HTML, and move on. No browser. No JavaScript execution. No analytics event. This applies to every AI crawler currently operating, from GPTBot to ClaudeBot to PerplexityBot and beyond.
The gap is structural, not configurable. No amount of GA4 settings will fix it because the bots never trigger the tracking code in the first place. The same limitation applies to Adobe Analytics, Mixpanel, and every other tool that depends on client-side execution.
The Bottom Line: If you rely on JavaScript-based analytics to understand AI bot activity, you are flying blind. Server-side tracking is the only way to see what is actually happening.
🤖 THE COMPLETE AI BOT USER AGENTS TABLE (2026)
There are now more than 15 distinct AI bot user agents crawling the web, each with different strings, different purposes, and different robots.txt compliance behaviors. A single typo in your robots.txt means the rule does nothing. Here is every bot you need to know, organized by operator.
OpenAI Bots
OpenAI runs three separate crawlers. This is the most important distinction in AI crawl management.
| Field | GPTBot | OAI-SearchBot | ChatGPT-User |
|---|---|---|---|
| User Agent Token | GPTBot |
OAI-SearchBot |
ChatGPT-User |
| Full UA String | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot) |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot) |
| Purpose | Training data collection | Web search for AI-generated results | Real-time page fetching during conversations |
| Robots.txt Compliance | Yes | Yes | No (reclassified Dec 2025) |
| Crawl Pattern | Broad, high-volume. Blocking this does NOT block ChatGPT search citations. | Powers search results in ChatGPT's search feature. | Triggered by user queries in real time. Lower volume but targeted. |
You can block GPTBot (training) while allowing OAI-SearchBot and ChatGPT-User (citations). This is the most common configuration for publishers who want AI citation visibility without contributing training data. For a deeper analysis, see OpenAI's Bots Have a Split Personality.
Anthropic Bots
Anthropic mirrors OpenAI's three-tier approach. The key behavioral difference is session caching: ClaudeBot caches page content within a conversation session rather than re-fetching on every reference.
| Field | ClaudeBot | Claude-User | Claude-SearchBot |
|---|---|---|---|
| User Agent Token | ClaudeBot |
Claude-User |
Claude-SearchBot |
| Full UA String | Mozilla/5.0 (compatible; ClaudeBot/1.0; +https://www.anthropic.com/crawlers) |
Mozilla/5.0 (compatible; Claude-User/1.0; +https://www.anthropic.com/crawlers) |
Mozilla/5.0 (compatible; Claude-SearchBot/1.0; +https://www.anthropic.com/crawlers) |
| Purpose | Training data and index building | Real-time page fetching during conversations | Search result retrieval for grounded responses |
| Robots.txt Compliance | Yes | Yes (session-cached) | Yes |
| Crawl Pattern | Session-cached crawling. Less aggressive than GPTBot. | Low volume, highly targeted. | Similar to OAI-SearchBot. |
Claude-User checks robots.txt once at the start of each session and caches the result. If you block Claude-User, it returns a "no permission" error and logs zero server hits for the blocked path. For more on Claude's web fetching architecture, see Claude Web Fetch Explained.
Perplexity Bots
| Field | PerplexityBot | Perplexity-User |
|---|---|---|
| User Agent Token | PerplexityBot |
Perplexity-User |
| Full UA String | Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) |
Mozilla/5.0 (compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexitybot) |
| Purpose | Index building and broad web crawling | Real-time page fetching for user browsing |
| Robots.txt Compliance | Yes (after 2024 compliance controversy) | Yes |
| Crawl Pattern | Aggressive. Re-crawl frequency varies by page authority. | Activated when users share specific URLs. |
Perplexity operates on a pre-built index model. Unlike ChatGPT and Claude, which fetch pages live during queries, Perplexity retrieves results from its index. Crawl frequency directly affects citation freshness. If PerplexityBot has not re-crawled your page recently, Perplexity may cite outdated content or miss it entirely.
Google AI Bots
| Field | Google-Extended | GoogleOther |
|---|---|---|
| User Agent Token | Google-Extended |
GoogleOther |
| Full UA String | Mozilla/5.0 (compatible; Google-Extended) |
Mozilla/5.0 (compatible; GoogleOther) |
| Purpose | Controls AI training/grounding for Gemini and AI Mode | General-purpose fetcher for non-search products |
| Robots.txt Compliance | Yes | Yes |
| Crawl Pattern | Does NOT affect regular Google Search indexing. | Lower priority for AI management. |
Google's approach differs from OpenAI and Anthropic. Instead of separate crawlers for training versus search, Google uses Google-Extended to govern whether your already-indexed content feeds into Gemini and AI Mode. Blocking Google-Extended does not remove you from Google Search. It only prevents AI feature usage.
Other Major AI Bots
| Bot | User Agent Token | Full UA String | Operator | Purpose | Robots.txt | Notes |
|---|---|---|---|---|---|---|
| Bytespider | Bytespider |
Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com) |
ByteDance | Training for TikTok search AI | Yes | Very aggressive. Many publishers block due to volume. |
| Meta-ExternalAgent | Meta-ExternalAgent |
Mozilla/5.0 (compatible; Meta-ExternalAgent/1.0; +https://www.meta.com/en-us/help/1417427585694498/) |
Meta | Training for Meta AI across Facebook, Instagram, WhatsApp | Yes | Distinct from FacebookBot (link previews). |
| FacebookBot | FacebookBot |
Mozilla/5.0 (compatible; FacebookBot/1.0; ...) |
Meta | Link preview rendering and Open Graph tags | Yes | NOT an AI training bot. Blocking breaks social sharing previews. |
| AmazonBot | Amazonbot |
Mozilla/5.0 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) |
Amazon | Alexa, Amazon Q, Amazon search | Yes | Moderate volume. Respects crawl-delay. |
| AppleBot | Applebot |
Mozilla/5.0 (Applebot/0.1; +http://www.apple.com/go/applebot) |
Apple | Siri, Apple Intelligence, Spotlight, Safari | Yes | Also uses Applebot-Extended for AI-specific training. |
| DuckAssistBot | DuckAssistBot |
Mozilla/5.0 (compatible; DuckAssistBot/1.1; +https://duckduckgo.com/duckassistbot) |
DuckDuckGo | AI-generated answers in DuckDuckGo | Yes | Lower volume. Focused on factual content. |
| Bingbot | bingbot |
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) |
Microsoft | Web indexing for Bing Search | Yes | Critical: ChatGPT uses Bing's index for URL discovery. |
| Diffbot | Diffbot |
Various | Diffbot | Web data extraction for downstream AI services | Yes | Used by many AI pipelines. |
| Timesbot | Timesbot |
Various | Brave | Brave Search AI features | Yes | Lower volume. |
| cohere-ai | cohere-ai |
Various | Cohere | Model training | Yes | Lower volume crawler. |
The Bottom Line: Bingbot deserves special attention even though it is a traditional search crawler. Lee (2026a) found that ChatGPT URL selection correlates with Bing indexing. Block Bingbot and you reduce your chances of appearing in ChatGPT search results, even if you allow all three OpenAI bots.
📊 TRAINING BOTS VS. SEARCH BOTS: THE KEY DISTINCTION
The most important management decision is distinguishing between bots that crawl for model training and bots that crawl to serve real-time search citations.
| Bot | Training | Search/Citation | Recommended Action |
|---|---|---|---|
| GPTBot | Yes | No | Block if you do not want content used for training |
| OAI-SearchBot | No | Yes | Allow for ChatGPT search visibility |
| ChatGPT-User | No | Yes | Allow (cannot be blocked via robots.txt anyway) |
| ClaudeBot | Yes | Partial (session cache) | Allow for Claude visibility; block only if training is a concern |
| Claude-User | No | Yes | Allow for live citation |
| Claude-SearchBot | No | Yes | Allow for Claude search visibility |
| PerplexityBot | Yes | Yes (index-based) | Allow for Perplexity visibility; dual-purpose bot |
| Perplexity-User | No | Yes | Allow for live browsing citations |
| Google-Extended | Yes (AI features) | Yes (AI Mode) | Allow for Google AI Mode visibility |
| Bytespider | Yes | Minimal | Block if volume is a problem or if you do not target TikTok |
| Meta-ExternalAgent | Yes | Emerging | Allow if you want Meta AI visibility |
| AmazonBot | Yes | Yes (Alexa, Amazon Q) | Allow for Amazon ecosystem visibility |
| Applebot | Yes | Yes (Siri, Apple Intelligence) | Allow for Apple ecosystem visibility |
| DuckAssistBot | No | Yes | Allow for DuckDuckGo AI answer visibility |
| bingbot | No (indexing) | Indirect (powers ChatGPT) | Always allow |
The minimum viable "allow" list for AI search visibility: OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Google-Extended, Applebot, and bingbot. Block any of these and you lose visibility on the corresponding platform.
🛡️ ROBOTS.TXT COMPLIANCE: WHO RESPECTS IT AND WHO IGNORES IT
We tested every major AI crawler against robots.txt directives using server-side Vercel middleware logs. Every request was logged with its user agent, path, and timestamp. Of 15+ tracked AI bots, only one ignores robots.txt outright: ChatGPT-User.
| Bot (User-Agent Token) | Operator | Respects robots.txt | Notes |
|---|---|---|---|
| GPTBot | OpenAI | Yes | Also follows Sitemap directives to discover content |
| OAI-SearchBot | OpenAI | Yes | Separate token from GPTBot |
| ChatGPT-User | OpenAI | No | Reclassified as "user extension" Dec 2025 |
| ClaudeBot | Anthropic | Yes | Standard crawler compliance |
| Claude-User | Anthropic | Yes (session-cached) | Checks robots.txt once per session, caches result |
| Claude-SearchBot | Anthropic | Yes | Standard compliance |
| PerplexityBot | Perplexity | Yes | Reliable since late 2024 compliance fix |
| Google-Extended | Yes | Separate from Googlebot; AI training only | |
| Bytespider | ByteDance | Yes | Often blocked due to aggressive crawl rates |
| Applebot-Extended | Apple | Yes | Separate from Applebot (Siri/Spotlight) |
| meta-externalagent | Meta | Yes | Relatively new token (2024) |
| Amazonbot | Amazon | Yes | Respects crawl-delay |
| cohere-ai | Cohere | Yes | Lower volume |
| Diffbot | Diffbot | Yes | Used by many downstream AI services |
| DuckAssistBot | DuckDuckGo | Yes | Lower volume |
Cui et al. (2025) conducted a systematic analysis of 18 LLM bots across 582,281 robots.txt files and confirmed a significant increase in AI-specific rules, particularly in finance and news categories. Their research also identified multiple cases of robots.txt violations, including instances where ChatGPT-User accessed restricted content.
The ChatGPT-User Exception
In December 2025, OpenAI updated its crawler documentation to formally remove any language suggesting ChatGPT-User would comply with robots.txt. Their rationale: ChatGPT-User is "a technical extension of the user" rather than an autonomous crawler. When someone asks ChatGPT to browse the web and it fetches your page, OpenAI considers that equivalent to a human clicking a link.
Testing confirms this. ChatGPT-User fetched pages that were explicitly blocked by Disallow directives. No robots.txt request preceded the page fetch in server logs.
Claude handles the same scenario differently. Claude-User checks robots.txt before its first fetch in each session and returns a "no permission" error if blocked. The difference in approach is stark:
| Behavior | ChatGPT-User | Claude-User |
|---|---|---|
| Checks robots.txt | No | Yes (once per session) |
| Honors Disallow directives | No | Yes |
| Blocking method | Server-level IP block only | robots.txt works |
If you need to control ChatGPT-User access, robots.txt will not help. Your only option is blocking OpenAI's IP ranges at the server level, which also blocks GPTBot and OAI-SearchBot. That is almost always counterproductive if you want AI visibility.
The Bottom Line: robots.txt is a polite request, not a firewall. Compliant bots honor it. ChatGPT-User does not. And blocking training bots retroactively does not erase content already in their training data. For the full technical breakdown, read OpenAI's Bots Have a Split Personality.
📝 ROBOTS.TXT TEMPLATES (COPY-PASTE READY)
Based on controlled testing and published research, here are three templates for the most common scenarios.
Template 1: Maximum AI Visibility (Recommended)
# === Standard Crawlers ===
User-agent: *
Allow: /
# === AI Training Crawlers ===
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Bytespider
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: meta-externalagent
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: cohere-ai
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: Diffbot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Timesbot
Allow: /
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/site-knowledge.jsonld
Why explicit Allow directives? While User-agent: * technically permits all bots, explicitly naming each AI crawler signals intent and prepares per-bot sections for path-specific restrictions later.
Template 2: Allow AI Search, Block AI Training
For publishers who want AI search citations but prefer not to contribute to training data:
# === Standard Crawlers ===
User-agent: *
Allow: /
# === AI Search Crawlers (Allowed) ===
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
# === AI Training Crawlers (Blocked) ===
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: anthropic-ai
Disallow: /
Sitemap: https://example.com/sitemap.xml
Trade-off: You retain search citation eligibility but lose GPTBot's Sitemap discovery benefits.
Template 3: Selective Path Blocking
For sites that want AI visibility for most content but need to protect specific sections:
# === Standard Crawlers ===
User-agent: *
Allow: /
# === AI Crawlers: Allow site, block premium ===
User-agent: GPTBot
Allow: /
Disallow: /premium/
Disallow: /members-only/
User-agent: ClaudeBot
Allow: /
Disallow: /premium/
Disallow: /members-only/
User-agent: PerplexityBot
Allow: /
Disallow: /premium/
Disallow: /members-only/
User-agent: OAI-SearchBot
Allow: /
Disallow: /premium/
Disallow: /members-only/
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/site-knowledge.jsonld
Note: ChatGPT-User will still fetch premium paths during live conversations regardless. Server-level IP blocking is the only reliable method for that bot.
🔍 ROBOTS.TXT AS A DISCOVERY MECHANISM
Here is what most guides miss: robots.txt is not just about blocking. It is one of the most reliable ways to help AI crawlers find your content.
Our robots.txt includes a Sitemap directive pointing to a JSON-LD knowledge graph file. That file is not linked from any page on the site. The only way to find it is to read robots.txt, parse the Sitemap directive, and follow the URL.
GPTBot did exactly that. Server logs show GPTBot fetching site-knowledge.jsonld, a file discoverable only through robots.txt. This confirms that GPTBot treats robots.txt as a discovery tool, not just a permission file.
Lee (2026a) found that schema markup presence is associated with 69% higher AI citation odds. robots.txt is the front door to that structured data.
Sitemap best practices for AI crawlers:
- Include your XML sitemap
- Reference structured data files (JSON-LD knowledge graphs)
- Keep sitemaps fresh with accurate lastmod timestamps
- Use multiple Sitemap directives if you have separate data for different purposes
📈 MONITORING METHODS COMPARED
Once you understand which bots exist and how to configure access, you need to actually track what they are doing on your site. There are three primary approaches, each with different tradeoffs.
Method 1: Server Log Analysis
Server logs are the ground truth. Every web server records every HTTP request, including the user-agent string. A typical AI crawler entry in your Nginx access log looks like this:
66.249.73.42 - - [24/Mar/2026:10:15:32 +0000] "GET /blog/example-post HTTP/1.1" 200 45230 "-" "Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot)"
Basic parsing commands:
# Count visits by each AI bot
grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended|ChatGPT-User|OAI-SearchBot" \
/var/log/nginx/access.log | grep -oP 'compatible; \K[^/;)]+' | sort | uniq -c | sort -rn
# See which pages GPTBot is crawling most
grep "GPTBot" /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20
Server log analysis is free, requires no third-party tools, and captures every single request. But it requires SSH access, manual parsing, and does not scale across multiple servers.
Method 2: Edge Middleware (Vercel, Cloudflare, Netlify)
Edge functions execute at the CDN level before a request reaches your origin server, meaning they see every request including AI crawlers.
Cloudflare Workers example:
export default {
async fetch(request, env) {
const ua = request.headers.get("user-agent") || "";
const aiPatterns = ["GPTBot", "OAI-SearchBot", "ChatGPT-User",
"ClaudeBot", "PerplexityBot", "Google-Extended", "Bytespider"];
const matchedBot = aiPatterns.find(bot => ua.includes(bot));
if (matchedBot) {
await env.BOT_LOGS.put(`${Date.now()}-${matchedBot}`,
JSON.stringify({ bot: matchedBot, url: request.url,
timestamp: new Date().toISOString() }));
}
return fetch(request);
}
};
Edge middleware is more structured than raw log parsing and integrates with modern deployment workflows. It works across distributed CDN nodes. However, it requires custom code, ongoing maintenance, and you still need to build your own dashboards and analysis pipeline.
Method 3: Dedicated Monitoring (BotSight)
BotSight is a purpose-built AI bot monitoring platform that identifies all 15+ AI crawlers automatically. It provides crawl analytics dashboards, AI visibility scores, freshness alerts when crawl patterns change, and crawl-to-citation correlation.
The comparison:
| Feature | Server Logs | Edge Middleware | BotSight |
|---|---|---|---|
| Setup complexity | Low (already exists) | Medium (custom code) | Low (add integration) |
| Captures all AI bots | Yes | Yes | Yes |
| Real-time alerting | No | Possible (custom) | Yes |
| Historical trends | Manual parsing | Custom database | Built-in dashboards |
| Recrawl pattern detection | Manual | Custom | Automatic |
| Citation correlation | No | No | Yes |
| Multi-site support | Difficult | Per-deployment | Built-in |
| Cost | Free | Platform costs | Subscription |
| Maintenance burden | High | Medium | Low |
The Bottom Line: Server logs are the ground truth and every other method builds on the same data. Edge middleware is the best DIY option for teams already on Vercel or Cloudflare. BotSight eliminates the build-versus-buy dilemma for teams that need actionable insights without custom infrastructure.
📊 BOT TRAFFIC PATTERNS AND WHAT THEY SIGNAL
Raw bot traffic logs are just numbers until you know what to look for. Every bot visit falls into one of five signal categories.
Signal 1: Discovery
A bot visits a page for the first time. This means it found your content through your sitemap, an external link, or internal navigation. If a page has never been discovered by a specific bot, that bot cannot cite it.
Signal 2: Indexing
The bot returns to the same page on a different day. Multi-day visits indicate the bot is actively indexing your content, not just discovering it.
Signal 3: Recrawl
The bot returns to an already-indexed page. This is the strongest positive signal. Recrawls mean the bot considers your content worth checking for updates. Lee (2026a) found that AI platforms with pre-built indices (like Perplexity) show strong freshness bias, recrawling updated content 3.3x faster than Google does for medium-velocity topics.
Signal 4: Burst Activity
A sudden spike in bot visits to a specific page. Bursts typically mean the page is being actively cited in user conversations (for live-fetching bots like ChatGPT and Claude) or has been flagged for re-indexing (for pre-built index bots like Perplexity).
Signal 5: Silence
A page that was previously crawled stops receiving bot visits. This is a negative signal. It may mean the bot has deprioritized your content, your robots.txt is blocking it, or competing content has taken priority.
The Bottom Line: Discovery and indexing are prerequisites. Recrawl frequency is the leading indicator of citation potential. Burst activity confirms active citation. Silence means you need to investigate.
Crawl Behavior by Platform
| Platform | Architecture | Volume | Freshness | Discovery Method |
|---|---|---|---|---|
| OpenAI (GPTBot) | Broad autonomous crawl | High | Days to weeks | Sitemap + link following |
| OpenAI (OAI-SearchBot) | Query-triggered fetch | Moderate | Real-time | Bing index + query relevance |
| OpenAI (ChatGPT-User) | User-triggered fetch | Low | Real-time | User URLs + live search |
| Anthropic (ClaudeBot) | Session-cached crawl | Moderate | Session-level | Link following + direct fetches |
| Perplexity (PerplexityBot) | Index-building crawl | High | Hours to days | Aggressive sitemaps + links |
| Google (Google-Extended) | Piggybacks on Googlebot | N/A | Matches Googlebot | Google's crawl infrastructure |
| ByteDance (Bytespider) | Broad autonomous crawl | Very high | Days | Aggressive link following |
| Meta (Meta-ExternalAgent) | Broad autonomous crawl | Moderate | Days to weeks | Link following |
| Amazon (AmazonBot) | Selective crawl | Low-moderate | Weeks | Sitemap + selective links |
| Apple (Applebot) | Selective crawl | Low-moderate | Days to weeks | Sitemap + link following |
For live-fetch platforms (ChatGPT, Claude), keeping your server fast and available matters most. For index-based platforms (Perplexity, Gemini), crawl frequency and sitemap accuracy drive freshness.
Content Types That Attract More Recrawls
FAQ-style content and structured Q&A pages receive substantially more recrawl activity than other content types. Lee (2026c) found that FAQPage schema carries an odds ratio of 1.39 for AI citation, meaning pages with FAQ schema are 39% more likely to be cited.
| Content Type | Relative Recrawl Rate | Why |
|---|---|---|
| FAQ / Q&A pages | 2x baseline | Discrete answers match AI retrieval patterns |
| How-to guides | 1.5x baseline | Step-by-step structure is highly extractable |
| Comparison pages | 1.3x baseline | Table-heavy format maps to comparison queries |
| Generic blog posts | 1x baseline | Narrative content is harder for AI to extract |
| Landing pages (thin) | 0.5x baseline | Low content density signals low value |
Key Metrics to Track
For each bot, track four things: total requests over 30 days, unique pages visited (page coverage), pages visited more than once (recrawl candidates), and most-visited pages (priority content). Then compare across bots. If ChatGPT's bot visits a page 88 times but PerplexityBot visits it zero times, your content is being cited in ChatGPT conversations but has not been indexed by Perplexity.
Monitor HTTP status codes returned to crawlers. A spike in 403 (forbidden) or 429 (rate limited) responses means you are actively blocking crawlers, potentially without realizing it. CDN-level bot protection (Cloudflare Bot Fight Mode, Sucuri) can block AI crawlers even when your robots.txt allows them.
The Bottom Line: Track patterns, not just numbers. A page getting crawled 100 times a day but never cited has a content problem. A page getting cited frequently but crawled rarely is at risk of going stale. For the factors that actually predict whether a page gets cited, see What Gets You Cited by AI, Explained.
🔧 COMMON MISTAKES AND HOW TO FIX THEM
Typos in user agent strings. Robots.txt matching is case-sensitive for some server implementations and case-insensitive for others. Use the exact capitalization shown in this guide. GPTBot is correct. gptbot and GptBot may not match.
Blocking the wrong OpenAI bot. Blocking GPTBot when you meant to block all OpenAI crawling still leaves OAI-SearchBot and ChatGPT-User active. Conversely, blocking ChatGPT-User when you only wanted to block training still leaves GPTBot running.
Confusing FacebookBot with Meta-ExternalAgent. FacebookBot renders link previews. Meta-ExternalAgent collects AI training data. Blocking FacebookBot breaks your social sharing cards. They serve completely different functions.
Forgetting bingbot is the ChatGPT discovery layer. You can allow all three OpenAI bots and still get zero ChatGPT citations if bingbot cannot index your pages.
Not blocking Bytespider when you should. Bytespider is the highest-volume AI crawler on most sites. If server load from AI crawlers is a concern, Bytespider is usually the first one to rate-limit or block.
Assuming Google-Extended controls Google Search. Blocking Google-Extended has zero effect on your traditional Google search rankings. It only controls AI overlay features (Gemini, AI Mode).
Blocking bots without monitoring. Configuring robots.txt without verifying crawl behavior in your logs is guesswork. Always verify changes by checking server logs or your monitoring tool for the expected changes in bot activity.
🚀 GETTING STARTED: A 30-MINUTE SETUP
You do not need a full monitoring solution to start.
Step 1: Check your robots.txt at https://yoursite.com/robots.txt to confirm you are not accidentally blocking GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, or Google-Extended.
Step 2: Run a free AI Visibility Quick Check to see which bots can currently access your site and how your configuration compares.
Step 3: Create a daily log summary script:
#!/bin/bash
BOTS="GPTBot|ClaudeBot|PerplexityBot|Google-Extended|ChatGPT-User|OAI-SearchBot"
echo "--- Visits by Bot ---"
grep -E "$BOTS" /var/log/nginx/access.log | grep -oP 'compatible; \K[^/;)]+' | sort | uniq -c | sort -rn
echo "--- Top 20 Pages Crawled ---"
grep -E "$BOTS" /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20
Step 4: Apply the robots.txt template that matches your strategy (Template 1, 2, or 3 above). Include Sitemap directives pointing to your XML sitemap and any structured data files.
Step 5: Graduate to sustained monitoring. Trend graphs, recrawl detection, and citation correlation require either significant custom work or a dedicated tool. Start with the basics, then decide how much automation you need.
For a comprehensive AI optimization strategy built on top of crawl access, see our AI SEO services.
❓ FREQUENTLY ASKED QUESTIONS
Can Google Analytics 4 track AI bots?
No. GA4 relies entirely on client-side JavaScript execution. AI crawlers do not execute JavaScript. They make HTTP requests, download the HTML, and leave. This means GA4 will never register a single AI bot visit. The same applies to Adobe Analytics, Mixpanel, Plausible, and Fathom. Server-side tracking is the only option.
How many AI bot user agents are active in 2026?
At least 16 distinct AI-related user agent tokens are actively crawling the web as of early 2026, operated by 10 different companies. This number continues to grow as new AI search products launch and existing operators add specialized crawler variants.
What is the difference between ChatGPT bot traffic and Perplexity bot traffic?
ChatGPT's bot fetches pages live during user conversations. It only visits your page when a user asks a question that triggers a search result matching your content. PerplexityBot crawls proactively, building a pre-built index independent of user queries. ChatGPT bot traffic tells you "users are asking about this topic right now." Perplexity bot traffic tells you "this content is being indexed for future use."
Can AI bots spoof their user agent strings?
Legitimate AI bots from major operators use consistent, verifiable user agent strings and publish their IP address ranges. However, malicious actors can spoof any user agent string. Always cross-reference the source IP address against the operator's published IP ranges. If a request claims to be GPTBot but comes from an unrecognized IP, it is not a real OpenAI crawler.
Should I block AI bots in robots.txt?
For most businesses, no. Blocking AI crawlers removes you from AI search citations and prevents discovery of your structured data. The only scenario where blocking makes sense is if you have strong licensing or copyright concerns about AI training specifically. Even then, ChatGPT-User will still fetch your content during live conversations regardless. A selective approach (Template 2 above, allowing search bots while blocking training bots) is the most practical middle ground.
Does blocking Google-Extended affect my Google Search rankings?
No. Google-Extended controls only whether your content is used for Gemini AI training and AI Mode features. It has no effect on Googlebot or your Google Search rankings. Many publishers blocked Google-Extended in 2024 thinking it would reduce Google's use of their content. It did not affect traditional search at all.
How quickly do AI bots pick up robots.txt changes?
Most AI crawlers re-fetch robots.txt every 24 to 72 hours. GPTBot and ClaudeBot typically check daily. PerplexityBot can take up to a week. Claude-User caches robots.txt at the session level, so existing sessions may not reflect changes until the next session starts. There is no way to force an immediate refresh.
What happens if my robots.txt has conflicting rules?
Per RFC 9309, the most specific rule wins. If you have Allow: /blog/ and Disallow: / for the same user agent, the bot should access /blog/ paths but not other paths. However, implementation varies by bot. Verify with server logs rather than assuming compliance.
📚 REFERENCES
- Lee, A. (2026a). "Query Intent, Not Google Rank: What Best Predicts AI Citation Behavior." Preprint v5, A.I. Plus Automation. DOI: 10.5281/zenodo.18653093
- Lee, A. (2026c). "I Rank on Page 1: What Gets Me Cited by AI? Position-Controlled Analysis of Page-Level and Domain-Level Predictors of AI Search Citation." A.I. Plus Automation. Paper | Dataset DOI: 10.5281/zenodo.19398158
- Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). "GEO: Generative Engine Optimization." KDD 2024. DOI: 10.48550/arXiv.2311.09735
- Longpre, S., Mahari, R., Lee, A., Lund, C.P., & Oderinwale, H. (2024). "Consent in Crisis: The Rapid Decline of the AI Data Commons." Preprint. DOI: 10.48550/arxiv.2407.14933
- Cui, J., Zha, M., Wang, X., & Liao, X. (2025). "The Odyssey of robots.txt Governance: Measuring Convention Implications of Web Bots in Large Language Model Services." ACM WWW 2025. DOI: 10.1145/3719027.3765063
- Kadel, J., See, A., Sinha, R., & Fischer, M. (2024). "BOTracle: A Framework for Discriminating Bots and Humans." arXiv. DOI: 10.48550/arXiv.2412.02266
- RFC 9309 (2022). "Robots Exclusion Protocol." IETF.