AI TOOLS

AI Visibility: A Complete Guide to AI Bot Tracking, robots.txt, and Monitoring

By Anthony Lee Published 2026-04-06

Your site is being crawled by more than 15 AI bots right now. Google Analytics cannot see any of them. This guide covers every user agent string, every robots.txt rule, and every monitoring method you need to take control of AI crawl access in 2026.

AI crawlers do not run JavaScript. They send HTTP requests, download raw HTML, and leave. Google Analytics 4, Plausible, Fathom, and every other client-side analytics tool will never register a single AI bot visit. You could have GPTBot hitting your site 10,000 times a day and your analytics dashboard would show zero.

This is not a minor gap. Lee (2026a) found that AI platforms like ChatGPT and Claude perform live page fetches during search queries, while Perplexity and Gemini rely on pre-built indices from regular crawling. Understanding crawl behavior is the first step to controlling your AI search visibility. You cannot optimize what you cannot measure.

This guide consolidates everything you need: the full user agent reference table, robots.txt compliance data, copy-paste configuration templates, monitoring method comparisons, and the traffic patterns that tell you where your content stands.

🚫 WHY GA4 MISSES AI BOT TRAFFIC

GA4 relies on a JavaScript snippet (gtag.js) that runs in the visitor's browser. When a human visits your page, their browser executes the script, which sends a tracking event to Google's servers.

AI crawlers skip all of that. They request the page, parse the HTML, and move on. No browser. No JavaScript execution. No analytics event. This applies to every AI crawler currently operating, from GPTBot to ClaudeBot to PerplexityBot and beyond.

The gap is structural, not configurable. No amount of GA4 settings will fix it because the bots never trigger the tracking code in the first place. The same limitation applies to Adobe Analytics, Mixpanel, and every other tool that depends on client-side execution.

The Bottom Line: If you rely on JavaScript-based analytics to understand AI bot activity, you are flying blind. Server-side tracking is the only way to see what is actually happening.

🤖 THE COMPLETE AI BOT USER AGENTS TABLE (2026)

There are now more than 15 distinct AI bot user agents crawling the web, each with different strings, different purposes, and different robots.txt compliance behaviors. A single typo in your robots.txt means the rule does nothing. Here is every bot you need to know, organized by operator.

OpenAI Bots

OpenAI runs three separate crawlers. This is the most important distinction in AI crawl management.

Field	GPTBot	OAI-SearchBot	ChatGPT-User
User Agent Token	`GPTBot`	`OAI-SearchBot`	`ChatGPT-User`
Full UA String	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)`	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot)`	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot)`
Purpose	Training data collection	Web search for AI-generated results	Real-time page fetching during conversations
Robots.txt Compliance	Yes	Yes	No (reclassified Dec 2025)
Crawl Pattern	Broad, high-volume. Blocking this does NOT block ChatGPT search citations.	Powers search results in ChatGPT's search feature.	Triggered by user queries in real time. Lower volume but targeted.

You can block GPTBot (training) while allowing OAI-SearchBot and ChatGPT-User (citations). This is the most common configuration for publishers who want AI citation visibility without contributing training data. For a deeper analysis, see OpenAI's Bots Have a Split Personality.

Anthropic Bots

Anthropic mirrors OpenAI's three-tier approach. The key behavioral difference is session caching: ClaudeBot caches page content within a conversation session rather than re-fetching on every reference.

Field	ClaudeBot	Claude-User	Claude-SearchBot
User Agent Token	`ClaudeBot`	`Claude-User`	`Claude-SearchBot`
Full UA String	`Mozilla/5.0 (compatible; ClaudeBot/1.0; +https://www.anthropic.com/crawlers)`	`Mozilla/5.0 (compatible; Claude-User/1.0; +https://www.anthropic.com/crawlers)`	`Mozilla/5.0 (compatible; Claude-SearchBot/1.0; +https://www.anthropic.com/crawlers)`
Purpose	Training data and index building	Real-time page fetching during conversations	Search result retrieval for grounded responses
Robots.txt Compliance	Yes	Yes (session-cached)	Yes
Crawl Pattern	Session-cached crawling. Less aggressive than GPTBot.	Low volume, highly targeted.	Similar to OAI-SearchBot.

Claude-User checks robots.txt once at the start of each session and caches the result. If you block Claude-User, it returns a "no permission" error and logs zero server hits for the blocked path. For more on Claude's web fetching architecture, see Claude Web Fetch Explained.

Perplexity Bots

Field	PerplexityBot	Perplexity-User
User Agent Token	`PerplexityBot`	`Perplexity-User`
Full UA String	`Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)`	`Mozilla/5.0 (compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexitybot)`
Purpose	Index building and broad web crawling	Real-time page fetching for user browsing
Robots.txt Compliance	Yes (after 2024 compliance controversy)	Yes
Crawl Pattern	Aggressive. Re-crawl frequency varies by page authority.	Activated when users share specific URLs.

Perplexity operates on a pre-built index model. Unlike ChatGPT and Claude, which fetch pages live during queries, Perplexity retrieves results from its index. Crawl frequency directly affects citation freshness. If PerplexityBot has not re-crawled your page recently, Perplexity may cite outdated content or miss it entirely.

Google AI Bots

Field	Google-Extended	GoogleOther
User Agent Token	`Google-Extended`	`GoogleOther`
Full UA String	`Mozilla/5.0 (compatible; Google-Extended)`	`Mozilla/5.0 (compatible; GoogleOther)`
Purpose	Controls AI training/grounding for Gemini and AI Mode	General-purpose fetcher for non-search products
Robots.txt Compliance	Yes	Yes
Crawl Pattern	Does NOT affect regular Google Search indexing.	Lower priority for AI management.

Google's approach differs from OpenAI and Anthropic. Instead of separate crawlers for training versus search, Google uses Google-Extended to govern whether your already-indexed content feeds into Gemini and AI Mode. Blocking Google-Extended does not remove you from Google Search. It only prevents AI feature usage.

Other Major AI Bots

Bot	User Agent Token	Full UA String	Operator	Purpose	Robots.txt	Notes
Bytespider	`Bytespider`	`Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com)`	ByteDance	Training for TikTok search AI	Yes	Very aggressive. Many publishers block due to volume.
Meta-ExternalAgent	`Meta-ExternalAgent`	`Mozilla/5.0 (compatible; Meta-ExternalAgent/1.0; +https://www.meta.com/en-us/help/1417427585694498/)`	Meta	Training for Meta AI across Facebook, Instagram, WhatsApp	Yes	Distinct from FacebookBot (link previews).
FacebookBot	`FacebookBot`	`Mozilla/5.0 (compatible; FacebookBot/1.0; ...)`	Meta	Link preview rendering and Open Graph tags	Yes	NOT an AI training bot. Blocking breaks social sharing previews.
AmazonBot	`Amazonbot`	`Mozilla/5.0 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)`	Amazon	Alexa, Amazon Q, Amazon search	Yes	Moderate volume. Respects crawl-delay.
AppleBot	`Applebot`	`Mozilla/5.0 (Applebot/0.1; +http://www.apple.com/go/applebot)`	Apple	Siri, Apple Intelligence, Spotlight, Safari	Yes	Also uses `Applebot-Extended` for AI-specific training.
DuckAssistBot	`DuckAssistBot`	`Mozilla/5.0 (compatible; DuckAssistBot/1.1; +https://duckduckgo.com/duckassistbot)`	DuckDuckGo	AI-generated answers in DuckDuckGo	Yes	Lower volume. Focused on factual content.
Bingbot	`bingbot`	`Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)`	Microsoft	Web indexing for Bing Search	Yes	Critical: ChatGPT uses Bing's index for URL discovery.
Diffbot	`Diffbot`	Various	Diffbot	Web data extraction for downstream AI services	Yes	Used by many AI pipelines.
Timesbot	`Timesbot`	Various	Brave	Brave Search AI features	Yes	Lower volume.
cohere-ai	`cohere-ai`	Various	Cohere	Model training	Yes	Lower volume crawler.

The Bottom Line: Bingbot deserves special attention even though it is a traditional search crawler. Lee (2026a) found that ChatGPT URL selection correlates with Bing indexing. Block Bingbot and you reduce your chances of appearing in ChatGPT search results, even if you allow all three OpenAI bots.

📊 TRAINING BOTS VS. SEARCH BOTS: THE KEY DISTINCTION

The most important management decision is distinguishing between bots that crawl for model training and bots that crawl to serve real-time search citations.

Bot	Training	Search/Citation	Recommended Action
GPTBot	Yes	No	Block if you do not want content used for training
OAI-SearchBot	No	Yes	Allow for ChatGPT search visibility
ChatGPT-User	No	Yes	Allow (cannot be blocked via robots.txt anyway)
ClaudeBot	Yes	Partial (session cache)	Allow for Claude visibility; block only if training is a concern
Claude-User	No	Yes	Allow for live citation
Claude-SearchBot	No	Yes	Allow for Claude search visibility
PerplexityBot	Yes	Yes (index-based)	Allow for Perplexity visibility; dual-purpose bot
Perplexity-User	No	Yes	Allow for live browsing citations
Google-Extended	Yes (AI features)	Yes (AI Mode)	Allow for Google AI Mode visibility
Bytespider	Yes	Minimal	Block if volume is a problem or if you do not target TikTok
Meta-ExternalAgent	Yes	Emerging	Allow if you want Meta AI visibility
AmazonBot	Yes	Yes (Alexa, Amazon Q)	Allow for Amazon ecosystem visibility
Applebot	Yes	Yes (Siri, Apple Intelligence)	Allow for Apple ecosystem visibility
DuckAssistBot	No	Yes	Allow for DuckDuckGo AI answer visibility
bingbot	No (indexing)	Indirect (powers ChatGPT)	Always allow

The minimum viable "allow" list for AI search visibility: OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Google-Extended, Applebot, and bingbot. Block any of these and you lose visibility on the corresponding platform.

🛡️ ROBOTS.TXT COMPLIANCE: WHO RESPECTS IT AND WHO IGNORES IT

We tested every major AI crawler against robots.txt directives using server-side Vercel middleware logs. Every request was logged with its user agent, path, and timestamp. Of 15+ tracked AI bots, only one ignores robots.txt outright: ChatGPT-User.

Bot (User-Agent Token)	Operator	Respects robots.txt	Notes
GPTBot	OpenAI	Yes	Also follows Sitemap directives to discover content
OAI-SearchBot	OpenAI	Yes	Separate token from GPTBot
ChatGPT-User	OpenAI	No	Reclassified as "user extension" Dec 2025
ClaudeBot	Anthropic	Yes	Standard crawler compliance
Claude-User	Anthropic	Yes (session-cached)	Checks robots.txt once per session, caches result
Claude-SearchBot	Anthropic	Yes	Standard compliance
PerplexityBot	Perplexity	Yes	Reliable since late 2024 compliance fix
Google-Extended	Google	Yes	Separate from Googlebot; AI training only
Bytespider	ByteDance	Yes	Often blocked due to aggressive crawl rates
Applebot-Extended	Apple	Yes	Separate from Applebot (Siri/Spotlight)
meta-externalagent	Meta	Yes	Relatively new token (2024)
Amazonbot	Amazon	Yes	Respects crawl-delay
cohere-ai	Cohere	Yes	Lower volume
Diffbot	Diffbot	Yes	Used by many downstream AI services
DuckAssistBot	DuckDuckGo	Yes	Lower volume

Cui et al. (2025) conducted a systematic analysis of 18 LLM bots across 582,281 robots.txt files and confirmed a significant increase in AI-specific rules, particularly in finance and news categories. Their research also identified multiple cases of robots.txt violations, including instances where ChatGPT-User accessed restricted content.

The ChatGPT-User Exception

In December 2025, OpenAI updated its crawler documentation to formally remove any language suggesting ChatGPT-User would comply with robots.txt. Their rationale: ChatGPT-User is "a technical extension of the user" rather than an autonomous crawler. When someone asks ChatGPT to browse the web and it fetches your page, OpenAI considers that equivalent to a human clicking a link.

Testing confirms this. ChatGPT-User fetched pages that were explicitly blocked by Disallow directives. No robots.txt request preceded the page fetch in server logs.

Claude handles the same scenario differently. Claude-User checks robots.txt before its first fetch in each session and returns a "no permission" error if blocked. The difference in approach is stark:

Behavior	ChatGPT-User	Claude-User
Checks robots.txt	No	Yes (once per session)
Honors Disallow directives	No	Yes
Blocking method	Server-level IP block only	robots.txt works

If you need to control ChatGPT-User access, robots.txt will not help. Your only option is blocking OpenAI's IP ranges at the server level, which also blocks GPTBot and OAI-SearchBot. That is almost always counterproductive if you want AI visibility.

The Bottom Line: robots.txt is a polite request, not a firewall. Compliant bots honor it. ChatGPT-User does not. And blocking training bots retroactively does not erase content already in their training data. For the full technical breakdown, read OpenAI's Bots Have a Split Personality.

📝 ROBOTS.TXT TEMPLATES (COPY-PASTE READY)

Based on controlled testing and published research, here are three templates for the most common scenarios.

Template 1: Maximum AI Visibility (Recommended)

# === Standard Crawlers ===
User-agent: *
Allow: /

# === AI Training Crawlers ===
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Bytespider
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: meta-externalagent
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Diffbot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Timesbot
Allow: /

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/site-knowledge.jsonld

Why explicit Allow directives? While User-agent: * technically permits all bots, explicitly naming each AI crawler signals intent and prepares per-bot sections for path-specific restrictions later.

Template 2: Allow AI Search, Block AI Training

For publishers who want AI search citations but prefer not to contribute to training data:

# === Standard Crawlers ===
User-agent: *
Allow: /

# === AI Search Crawlers (Allowed) ===
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# === AI Training Crawlers (Blocked) ===
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: anthropic-ai
Disallow: /

Sitemap: https://example.com/sitemap.xml

Trade-off: You retain search citation eligibility but lose GPTBot's Sitemap discovery benefits.

Template 3: Selective Path Blocking

For sites that want AI visibility for most content but need to protect specific sections:

# === Standard Crawlers ===
User-agent: *
Allow: /

# === AI Crawlers: Allow site, block premium ===
User-agent: GPTBot
Allow: /
Disallow: /premium/
Disallow: /members-only/

User-agent: ClaudeBot
Allow: /
Disallow: /premium/
Disallow: /members-only/

User-agent: PerplexityBot
Allow: /
Disallow: /premium/
Disallow: /members-only/

User-agent: OAI-SearchBot
Allow: /
Disallow: /premium/
Disallow: /members-only/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/site-knowledge.jsonld

Note: ChatGPT-User will still fetch premium paths during live conversations regardless. Server-level IP blocking is the only reliable method for that bot.

🔍 ROBOTS.TXT AS A DISCOVERY MECHANISM

Here is what most guides miss: robots.txt is not just about blocking. It is one of the most reliable ways to help AI crawlers find your content.

Our robots.txt includes a Sitemap directive pointing to a JSON-LD knowledge graph file. That file is not linked from any page on the site. The only way to find it is to read robots.txt, parse the Sitemap directive, and follow the URL.

GPTBot did exactly that. Server logs show GPTBot fetching site-knowledge.jsonld, a file discoverable only through robots.txt. This confirms that GPTBot treats robots.txt as a discovery tool, not just a permission file.

Lee (2026a) found that schema markup presence is associated with 69% higher AI citation odds. robots.txt is the front door to that structured data.

Sitemap best practices for AI crawlers:

Include your XML sitemap
Reference structured data files (JSON-LD knowledge graphs)
Keep sitemaps fresh with accurate lastmod timestamps
Use multiple Sitemap directives if you have separate data for different purposes

📈 MONITORING METHODS COMPARED

Once you understand which bots exist and how to configure access, you need to actually track what they are doing on your site. There are three primary approaches, each with different tradeoffs.

Method 1: Server Log Analysis

Server logs are the ground truth. Every web server records every HTTP request, including the user-agent string. A typical AI crawler entry in your Nginx access log looks like this:

66.249.73.42 - - [24/Mar/2026:10:15:32 +0000] "GET /blog/example-post HTTP/1.1" 200 45230 "-" "Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot)"

Basic parsing commands:

# Count visits by each AI bot
grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended|ChatGPT-User|OAI-SearchBot" \
  /var/log/nginx/access.log | grep -oP 'compatible; \K[^/;)]+' | sort | uniq -c | sort -rn

# See which pages GPTBot is crawling most
grep "GPTBot" /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20

Server log analysis is free, requires no third-party tools, and captures every single request. But it requires SSH access, manual parsing, and does not scale across multiple servers.

Method 2: Edge Middleware (Vercel, Cloudflare, Netlify)

Edge functions execute at the CDN level before a request reaches your origin server, meaning they see every request including AI crawlers.

Cloudflare Workers example:

export default {
  async fetch(request, env) {
    const ua = request.headers.get("user-agent") || "";
    const aiPatterns = ["GPTBot", "OAI-SearchBot", "ChatGPT-User",
      "ClaudeBot", "PerplexityBot", "Google-Extended", "Bytespider"];
    const matchedBot = aiPatterns.find(bot => ua.includes(bot));
    if (matchedBot) {
      await env.BOT_LOGS.put(`${Date.now()}-${matchedBot}`,
        JSON.stringify({ bot: matchedBot, url: request.url,
          timestamp: new Date().toISOString() }));
    }
    return fetch(request);
  }
};

Edge middleware is more structured than raw log parsing and integrates with modern deployment workflows. It works across distributed CDN nodes. However, it requires custom code, ongoing maintenance, and you still need to build your own dashboards and analysis pipeline.

Method 3: Dedicated Monitoring (BotSight)

BotSight is a purpose-built AI bot monitoring platform that identifies all 15+ AI crawlers automatically. It provides crawl analytics dashboards, AI visibility scores, freshness alerts when crawl patterns change, and crawl-to-citation correlation.

The comparison:

Feature	Server Logs	Edge Middleware	BotSight
Setup complexity	Low (already exists)	Medium (custom code)	Low (add integration)
Captures all AI bots	Yes	Yes	Yes
Real-time alerting	No	Possible (custom)	Yes
Historical trends	Manual parsing	Custom database	Built-in dashboards
Recrawl pattern detection	Manual	Custom	Automatic
Citation correlation	No	No	Yes
Multi-site support	Difficult	Per-deployment	Built-in
Cost	Free	Platform costs	Subscription
Maintenance burden	High	Medium	Low

The Bottom Line: Server logs are the ground truth and every other method builds on the same data. Edge middleware is the best DIY option for teams already on Vercel or Cloudflare. BotSight eliminates the build-versus-buy dilemma for teams that need actionable insights without custom infrastructure.

📊 BOT TRAFFIC PATTERNS AND WHAT THEY SIGNAL

Raw bot traffic logs are just numbers until you know what to look for. Every bot visit falls into one of five signal categories.

Signal 1: Discovery

A bot visits a page for the first time. This means it found your content through your sitemap, an external link, or internal navigation. If a page has never been discovered by a specific bot, that bot cannot cite it.

Signal 2: Indexing

The bot returns to the same page on a different day. Multi-day visits indicate the bot is actively indexing your content, not just discovering it.

Signal 3: Recrawl

The bot returns to an already-indexed page. This is the strongest positive signal. Recrawls mean the bot considers your content worth checking for updates. Lee (2026a) found that AI platforms with pre-built indices (like Perplexity) show strong freshness bias, recrawling updated content 3.3x faster than Google does for medium-velocity topics.

Signal 4: Burst Activity

A sudden spike in bot visits to a specific page. Bursts typically mean the page is being actively cited in user conversations (for live-fetching bots like ChatGPT and Claude) or has been flagged for re-indexing (for pre-built index bots like Perplexity).

Signal 5: Silence

A page that was previously crawled stops receiving bot visits. This is a negative signal. It may mean the bot has deprioritized your content, your robots.txt is blocking it, or competing content has taken priority.

The Bottom Line: Discovery and indexing are prerequisites. Recrawl frequency is the leading indicator of citation potential. Burst activity confirms active citation. Silence means you need to investigate.

Crawl Behavior by Platform

Platform	Architecture	Volume	Freshness	Discovery Method
OpenAI (GPTBot)	Broad autonomous crawl	High	Days to weeks	Sitemap + link following
OpenAI (OAI-SearchBot)	Query-triggered fetch	Moderate	Real-time	Bing index + query relevance
OpenAI (ChatGPT-User)	User-triggered fetch	Low	Real-time	User URLs + live search
Anthropic (ClaudeBot)	Session-cached crawl	Moderate	Session-level	Link following + direct fetches
Perplexity (PerplexityBot)	Index-building crawl	High	Hours to days	Aggressive sitemaps + links
Google (Google-Extended)	Piggybacks on Googlebot	N/A	Matches Googlebot	Google's crawl infrastructure
ByteDance (Bytespider)	Broad autonomous crawl	Very high	Days	Aggressive link following
Meta (Meta-ExternalAgent)	Broad autonomous crawl	Moderate	Days to weeks	Link following
Amazon (AmazonBot)	Selective crawl	Low-moderate	Weeks	Sitemap + selective links
Apple (Applebot)	Selective crawl	Low-moderate	Days to weeks	Sitemap + link following

For live-fetch platforms (ChatGPT, Claude), keeping your server fast and available matters most. For index-based platforms (Perplexity, Gemini), crawl frequency and sitemap accuracy drive freshness.

Content Types That Attract More Recrawls

FAQ-style content and structured Q&A pages receive substantially more recrawl activity than other content types. Lee (2026c) found that FAQPage schema carries an odds ratio of 1.39 for AI citation, meaning pages with FAQ schema are 39% more likely to be cited.

Content Type	Relative Recrawl Rate	Why
FAQ / Q&A pages	2x baseline	Discrete answers match AI retrieval patterns
How-to guides	1.5x baseline	Step-by-step structure is highly extractable
Comparison pages	1.3x baseline	Table-heavy format maps to comparison queries
Generic blog posts	1x baseline	Narrative content is harder for AI to extract
Landing pages (thin)	0.5x baseline	Low content density signals low value

Key Metrics to Track

For each bot, track four things: total requests over 30 days, unique pages visited (page coverage), pages visited more than once (recrawl candidates), and most-visited pages (priority content). Then compare across bots. If ChatGPT's bot visits a page 88 times but PerplexityBot visits it zero times, your content is being cited in ChatGPT conversations but has not been indexed by Perplexity.

Monitor HTTP status codes returned to crawlers. A spike in 403 (forbidden) or 429 (rate limited) responses means you are actively blocking crawlers, potentially without realizing it. CDN-level bot protection (Cloudflare Bot Fight Mode, Sucuri) can block AI crawlers even when your robots.txt allows them.

The Bottom Line: Track patterns, not just numbers. A page getting crawled 100 times a day but never cited has a content problem. A page getting cited frequently but crawled rarely is at risk of going stale. For the factors that actually predict whether a page gets cited, see What Gets You Cited by AI, Explained.

🔧 COMMON MISTAKES AND HOW TO FIX THEM

Typos in user agent strings. Robots.txt matching is case-sensitive for some server implementations and case-insensitive for others. Use the exact capitalization shown in this guide. GPTBot is correct. gptbot and GptBot may not match.

Blocking the wrong OpenAI bot. Blocking GPTBot when you meant to block all OpenAI crawling still leaves OAI-SearchBot and ChatGPT-User active. Conversely, blocking ChatGPT-User when you only wanted to block training still leaves GPTBot running.

Confusing FacebookBot with Meta-ExternalAgent. FacebookBot renders link previews. Meta-ExternalAgent collects AI training data. Blocking FacebookBot breaks your social sharing cards. They serve completely different functions.

Forgetting bingbot is the ChatGPT discovery layer. You can allow all three OpenAI bots and still get zero ChatGPT citations if bingbot cannot index your pages.

Not blocking Bytespider when you should. Bytespider is the highest-volume AI crawler on most sites. If server load from AI crawlers is a concern, Bytespider is usually the first one to rate-limit or block.

Assuming Google-Extended controls Google Search. Blocking Google-Extended has zero effect on your traditional Google search rankings. It only controls AI overlay features (Gemini, AI Mode).

Blocking bots without monitoring. Configuring robots.txt without verifying crawl behavior in your logs is guesswork. Always verify changes by checking server logs or your monitoring tool for the expected changes in bot activity.

🚀 GETTING STARTED: A 30-MINUTE SETUP

You do not need a full monitoring solution to start.

Step 1: Check your robots.txt at https://yoursite.com/robots.txt to confirm you are not accidentally blocking GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, or Google-Extended.

Step 2: Run a free AI Visibility Quick Check to see which bots can currently access your site and how your configuration compares.

Step 3: Create a daily log summary script:

#!/bin/bash
BOTS="GPTBot|ClaudeBot|PerplexityBot|Google-Extended|ChatGPT-User|OAI-SearchBot"
echo "--- Visits by Bot ---"
grep -E "$BOTS" /var/log/nginx/access.log | grep -oP 'compatible; \K[^/;)]+' | sort | uniq -c | sort -rn
echo "--- Top 20 Pages Crawled ---"
grep -E "$BOTS" /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20

Step 4: Apply the robots.txt template that matches your strategy (Template 1, 2, or 3 above). Include Sitemap directives pointing to your XML sitemap and any structured data files.

Step 5: Graduate to sustained monitoring. Trend graphs, recrawl detection, and citation correlation require either significant custom work or a dedicated tool. Start with the basics, then decide how much automation you need.

For a comprehensive AI optimization strategy built on top of crawl access, see our AI SEO services.

❓ FREQUENTLY ASKED QUESTIONS

Can Google Analytics 4 track AI bots?

No. GA4 relies entirely on client-side JavaScript execution. AI crawlers do not execute JavaScript. They make HTTP requests, download the HTML, and leave. This means GA4 will never register a single AI bot visit. The same applies to Adobe Analytics, Mixpanel, Plausible, and Fathom. Server-side tracking is the only option.

How many AI bot user agents are active in 2026?

At least 16 distinct AI-related user agent tokens are actively crawling the web as of early 2026, operated by 10 different companies. This number continues to grow as new AI search products launch and existing operators add specialized crawler variants.

What is the difference between ChatGPT bot traffic and Perplexity bot traffic?

ChatGPT's bot fetches pages live during user conversations. It only visits your page when a user asks a question that triggers a search result matching your content. PerplexityBot crawls proactively, building a pre-built index independent of user queries. ChatGPT bot traffic tells you "users are asking about this topic right now." Perplexity bot traffic tells you "this content is being indexed for future use."

Can AI bots spoof their user agent strings?

Legitimate AI bots from major operators use consistent, verifiable user agent strings and publish their IP address ranges. However, malicious actors can spoof any user agent string. Always cross-reference the source IP address against the operator's published IP ranges. If a request claims to be GPTBot but comes from an unrecognized IP, it is not a real OpenAI crawler.

Should I block AI bots in robots.txt?

For most businesses, no. Blocking AI crawlers removes you from AI search citations and prevents discovery of your structured data. The only scenario where blocking makes sense is if you have strong licensing or copyright concerns about AI training specifically. Even then, ChatGPT-User will still fetch your content during live conversations regardless. A selective approach (Template 2 above, allowing search bots while blocking training bots) is the most practical middle ground.

Does blocking Google-Extended affect my Google Search rankings?

No. Google-Extended controls only whether your content is used for Gemini AI training and AI Mode features. It has no effect on Googlebot or your Google Search rankings. Many publishers blocked Google-Extended in 2024 thinking it would reduce Google's use of their content. It did not affect traditional search at all.

How quickly do AI bots pick up robots.txt changes?

Most AI crawlers re-fetch robots.txt every 24 to 72 hours. GPTBot and ClaudeBot typically check daily. PerplexityBot can take up to a week. Claude-User caches robots.txt at the session level, so existing sessions may not reflect changes until the next session starts. There is no way to force an immediate refresh.

What happens if my robots.txt has conflicting rules?

Per RFC 9309, the most specific rule wins. If you have Allow: /blog/ and Disallow: / for the same user agent, the bot should access /blog/ paths but not other paths. However, implementation varies by bot. Verify with server logs rather than assuming compliance.

📚 REFERENCES

Lee, A. (2026a). "Query Intent and Google Rank as Joint Predictors of AI Citation: A Multi-Platform Observational Study." Preprint v6, A.I. Plus Automation. DOI: 10.5281/zenodo.18653093
Lee, A. (2026c). "I Rank on Page 1: What Gets Me Cited by AI? Position-Controlled Analysis of Page-Level and Domain-Level Predictors of AI Search Citation." A.I. Plus Automation. Paper | Dataset DOI: 10.5281/zenodo.19398158
Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). "GEO: Generative Engine Optimization." KDD 2024. DOI: 10.48550/arXiv.2311.09735
Longpre, S., Mahari, R., Lee, A., Lund, C.P., & Oderinwale, H. (2024). "Consent in Crisis: The Rapid Decline of the AI Data Commons." Preprint. DOI: 10.48550/arxiv.2407.14933
Cui, J., Zha, M., Wang, X., & Liao, X. (2025). "The Odyssey of robots.txt Governance: Measuring Convention Implications of Web Bots in Large Language Model Services." ACM WWW 2025. DOI: 10.1145/3719027.3765063
Kadel, J., See, A., Sinha, R., & Fischer, M. (2024). "BOTracle: A Framework for Discriminating Bots and Humans." arXiv. DOI: 10.48550/arXiv.2412.02266
RFC 9309 (2022). "Robots Exclusion Protocol." IETF.