There are now more than 15 distinct AI bot user agents crawling the web, each with different strings, different purposes, and different robots.txt compliance behaviors. If you do not know every one of them by name, you are almost certainly blocking traffic you want or allowing access you do not.
The AI crawler landscape changed dramatically between 2024 and 2026. What was once a handful of experimental bots has become a full ecosystem of specialized crawlers, each operated by a major AI company, each serving a distinct function. Some crawl for model training data. Some crawl for real-time search answers. Some do both. And the user agent strings they send are the only way to tell them apart.
This reference page catalogs every known AI bot user agent active in 2026. For each bot, you get the exact user agent string, the operator, the stated purpose, whether it respects robots.txt, and practical notes on crawl behavior. Bookmark this page. You will need it.
The Bottom Line: Managing AI crawl access requires knowing the exact user agent string for each bot. A single typo in your robots.txt means the rule does nothing. This reference gives you every string you need, verified against official documentation as of March 2026.
If you want to see which of these bots are already hitting your site, our Track AI Bots Guide walks through server log analysis step by step. For robots.txt configuration, see our Robots.txt for AI Bots Guide.
🤖 WHY AI BOT USER AGENTS MATTER NOW
Traditional web crawlers were simple. Googlebot, Bingbot, a handful of others. You knew them. You configured your robots.txt once, and the job was done for years.
AI bots broke that model. Research from Lee (2026) found that ChatGPT and Claude perform live page fetches for search queries, while Perplexity and Gemini rely on pre-built indices. That architectural difference means blocking one bot has very different consequences than blocking another. Block GPTBot and you lose training inclusion but keep ChatGPT search citations (via ChatGPT-User). Block ChatGPT-User and you lose real-time citations entirely. The distinction matters.
Aggarwal et al. (2024) demonstrated that targeted optimization for generative engines can boost content visibility by up to 40%. But optimization is meaningless if the crawler cannot reach your pages. Crawl access is the foundation layer, and user agent identification is how you control it.
The proliferation of AI bots also creates a measurement challenge. Kadel et al. (2024) found that modern bots are increasingly difficult to distinguish from real users when relying on behavioral signals alone, making user agent string verification even more critical as a first-pass identification method.
The Bottom Line: You cannot optimize for AI platforms you are accidentally blocking. And you cannot block AI platforms you do not know exist. This list solves both problems.
📋 THE COMPLETE AI BOT USER AGENTS TABLE (2026)
This is the core reference. Every AI bot user agent known to be active as of March 2026, organized by operator.
OpenAI Bots
| Field | GPTBot | OAI-SearchBot | ChatGPT-User |
|---|---|---|---|
| User Agent Token | GPTBot |
OAI-SearchBot |
ChatGPT-User |
| Full UA String | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot) |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot) |
| Operator | OpenAI | OpenAI | OpenAI |
| Purpose | Training data collection and general web crawling | Web search for AI-generated search results | Real-time page fetching when users request URLs or ChatGPT performs live search |
| Robots.txt Compliance | Yes | Yes | Yes |
| Crawl Pattern Notes | Crawls broadly. High volume. Respects crawl-delay. Distinct from search/citation bots. Blocking this does NOT block ChatGPT search citations. | Powers the search results in ChatGPT's search feature. Fetches pages for inclusion in search-grounded responses. | Fetches specific URLs during user conversations. Lower volume but targeted. Triggered by user queries in real time. |
OpenAI's three-bot architecture is the most important distinction in AI crawl management. GPTBot is the training crawler. OAI-SearchBot handles search result retrieval. ChatGPT-User handles live browsing during conversations. You can block GPTBot (training) while allowing OAI-SearchBot and ChatGPT-User (citations). This is the recommended configuration for publishers who want AI citation visibility without contributing training data. For a deeper analysis of how these bots interact, see our post on OpenAI's split personality bot architecture.
Anthropic Bots
| Field | ClaudeBot | Claude-User | Claude-SearchBot |
|---|---|---|---|
| User Agent Token | ClaudeBot |
Claude-User |
Claude-SearchBot |
| Full UA String | Mozilla/5.0 (compatible; ClaudeBot/1.0; +https://www.anthropic.com/crawlers) |
Mozilla/5.0 (compatible; Claude-User/1.0; +https://www.anthropic.com/crawlers) |
Mozilla/5.0 (compatible; Claude-SearchBot/1.0; +https://www.anthropic.com/crawlers) |
| Operator | Anthropic | Anthropic | Anthropic |
| Purpose | Training data collection and index building | Real-time page fetching during user conversations | Web search result retrieval for search-grounded responses |
| Robots.txt Compliance | Yes | Yes | Yes |
| Crawl Pattern Notes | Session-cached crawling pattern. Fetches pages and caches content for the duration of a conversation session. Less aggressive than GPTBot. | Triggered when users share URLs or Claude performs web lookups during conversations. Low volume, highly targeted. | Retrieves search results to ground Claude's responses in current web data. Operates similarly to OAI-SearchBot. |
Anthropic's crawler architecture mirrors OpenAI's three-tier approach. The key behavioral difference is session caching: ClaudeBot tends to cache page content within a conversation session rather than re-fetching on every reference. Lee (2026) confirmed that Claude performs live page fetches (like ChatGPT), placing it in the "real-time retrieval" architectural category rather than the "pre-built index" category used by Perplexity and Gemini.
Perplexity Bots
| Field | PerplexityBot | Perplexity-User |
|---|---|---|
| User Agent Token | PerplexityBot |
Perplexity-User |
| Full UA String | Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) |
Mozilla/5.0 (compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexitybot) |
| Operator | Perplexity AI | Perplexity AI |
| Purpose | Index building and broad web crawling for search | Real-time page fetching for user-initiated browsing |
| Robots.txt Compliance | Yes (after 2024 compliance controversy) | Yes |
| Crawl Pattern Notes | Builds a pre-indexed corpus. Crawls aggressively and frequently. Pages in the Perplexity index get cited in search results. Re-crawl frequency varies by page authority. | Activated when users share specific URLs or request page analysis. Lower volume. |
Perplexity operates on a pre-built index model. Unlike ChatGPT and Claude, which fetch pages live during search queries, Perplexity retrieves results from its own index. This means crawl frequency directly affects citation freshness. If PerplexityBot has not re-crawled your page recently, Perplexity may cite outdated content or miss your page entirely.
The 2024 robots.txt controversy is worth noting. Early reports indicated PerplexityBot was not consistently honoring robots.txt directives. Perplexity addressed this in late 2024, and current evidence suggests compliance is now reliable. However, if you are blocking PerplexityBot in your robots.txt, verify the block is actually working by checking your server logs.
Google AI Bots
| Field | Google-Extended | GoogleOther |
|---|---|---|
| User Agent Token | Google-Extended |
GoogleOther |
| Full UA String | Mozilla/5.0 (compatible; Google-Extended) |
Mozilla/5.0 (compatible; GoogleOther) |
| Operator | ||
| Purpose | Controls content use for Gemini and AI Mode training/grounding | General-purpose fetcher for non-search Google products |
| Robots.txt Compliance | Yes | Yes |
| Crawl Pattern Notes | Blocking Google-Extended prevents content from being used in Gemini and Google AI Mode responses. Does NOT affect regular Google Search indexing. Googlebot remains a separate user agent. | Catch-all fetcher for various Google services. Lower priority for AI-specific management but still relevant for content access control. |
Google's approach is different from OpenAI and Anthropic. Instead of separate crawlers for training versus search, Google uses a single control mechanism (Google-Extended) that governs whether your already-indexed content is used for AI features. Your content is crawled by Googlebot regardless. Google-Extended controls whether that content feeds into Gemini and AI Mode. Blocking Google-Extended does not remove you from Google Search results; it only prevents AI feature usage.
This is a critical distinction. Many publishers blocked Google-Extended in 2024 thinking it would reduce Google's use of their content. It does not affect traditional search at all. It only controls AI overlay features.
Other Major AI Bots
| Bot | User Agent Token | Full UA String | Operator | Purpose | Robots.txt | Notes |
|---|---|---|---|---|---|---|
| Bytespider | Bytespider |
Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com) |
ByteDance (TikTok) | Training data for ByteDance AI models, including those powering TikTok search features | Yes | Very aggressive crawler. High request volume. Many publishers block this due to volume concerns. |
| Meta-ExternalAgent | Meta-ExternalAgent |
Mozilla/5.0 (compatible; Meta-ExternalAgent/1.0; +https://www.meta.com/en-us/help/1417427585694498/) |
Meta | Training data for Meta AI features across Facebook, Instagram, WhatsApp | Yes | Newer bot (emerged 2024). Distinct from FacebookBot, which handles link preview rendering. |
| FacebookBot | FacebookBot |
Mozilla/5.0 (compatible; FacebookBot/1.0; +https://www.facebook.com/externalhit_uatext.php) |
Meta | Link preview rendering and Open Graph tag retrieval | Yes | NOT an AI training bot. Fetches pages when users share links on Meta platforms. Blocking this breaks social sharing previews. |
| AmazonBot | Amazonbot |
Mozilla/5.0 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) |
Amazon | Powering Alexa answers, Amazon Q, and Amazon search features | Yes | Moderate volume. Focused on commercial and informational content. |
| AppleBot | Applebot |
Mozilla/5.0 (Applebot/0.1; +http://www.apple.com/go/applebot) |
Apple | Siri, Apple Intelligence, Spotlight search, and Safari suggestions | Yes | Also uses Applebot-Extended token for AI-specific content use (similar to Google-Extended). |
| DuckAssistBot | DuckAssistBot |
Mozilla/5.0 (compatible; DuckAssistBot/1.1; +https://duckduckgo.com/duckassistbot) |
DuckDuckGo | AI-generated answers in DuckDuckGo search (DuckAssist feature) | Yes | Lower volume. Focused on factual and reference content. |
| Bingbot | bingbot |
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) |
Microsoft | Web indexing for Bing Search, which also powers ChatGPT URL discovery | Yes | Not strictly an AI bot, but critical for AI visibility. ChatGPT uses Bing's index for URL discovery. If Bingbot cannot crawl your page, ChatGPT may never find it. |
The Bottom Line: Bingbot deserves special attention. Even though it is a traditional search crawler, it serves as the discovery layer for ChatGPT's search grounding. Lee (2026) found that ChatGPT URL selection correlates with Bing indexing. Block Bingbot and you reduce your chances of appearing in ChatGPT search results, even if you allow all three OpenAI bots.
📊 COMPARISON: TRAINING BOTS VS. SEARCH BOTS
Not all AI bots serve the same purpose. The most important management decision is distinguishing between bots that crawl for model training and bots that crawl to serve real-time search citations. Here is the breakdown:
| Bot | Training | Search/Citation | Recommended Action |
|---|---|---|---|
| GPTBot | Yes | No | Block if you do not want content used for training |
| OAI-SearchBot | No | Yes | Allow for ChatGPT search visibility |
| ChatGPT-User | No | Yes | Allow for live citation in conversations |
| ClaudeBot | Yes | Partial (session cache) | Allow for Claude visibility; block only if training is a concern |
| Claude-User | No | Yes | Allow for live citation |
| Claude-SearchBot | No | Yes | Allow for Claude search visibility |
| PerplexityBot | Yes | Yes (index-based) | Allow for Perplexity visibility; dual-purpose bot |
| Perplexity-User | No | Yes | Allow for live browsing citations |
| Google-Extended | Yes (AI features) | Yes (AI Mode) | Allow for Google AI Mode visibility |
| GoogleOther | Variable | No | Allow unless you have specific concerns |
| Bytespider | Yes | Minimal | Block if volume is a problem or if you do not target TikTok |
| Meta-ExternalAgent | Yes | Emerging | Allow if you want Meta AI visibility |
| AmazonBot | Yes | Yes (Alexa, Amazon Q) | Allow for Amazon ecosystem visibility |
| Applebot | Yes | Yes (Siri, Apple Intelligence) | Allow for Apple ecosystem visibility |
| DuckAssistBot | No | Yes | Allow for DuckDuckGo AI answer visibility |
| bingbot | No (indexing) | Indirect (powers ChatGPT discovery) | Always allow; critical for ChatGPT discoverability |
The Bottom Line: The minimum viable "allow" list for AI search visibility includes: OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Google-Extended, Applebot, and bingbot. Block any of these and you lose visibility on the corresponding platform.
🔧 ROBOTS.TXT CONFIGURATION REFERENCE
Here is the practical robots.txt configuration for the three most common scenarios. Copy these directly into your robots.txt file.
Scenario 1: Allow Everything (Maximum Visibility)
Do not add any AI-specific blocks. The default state of robots.txt allows all crawlers. If you have existing blocks, remove them.
Scenario 2: Allow Citations, Block Training
This is the most popular configuration for publishers in 2026. It allows real-time search and citation bots while blocking training-only crawlers:
# Block training-only crawlers
User-agent: GPTBot
Disallow: /
User-agent: Bytespider
Disallow: /
# Allow all citation and search bots (default allow)
# OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-User,
# Claude-SearchBot, PerplexityBot, Perplexity-User,
# Google-Extended, Applebot, DuckAssistBot, bingbot
# all allowed by default (no Disallow rules needed)
Scenario 3: Block All AI Bots
This removes your content from all AI platforms. Only use this if you have a specific business reason:
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: DuckAssistBot
Disallow: /
Note: Do NOT block Applebot or bingbot in this list unless you also want to lose Siri and Bing search visibility entirely, since those bots serve traditional search functions as well.
For a detailed walkthrough of robots.txt strategies for AI bots, see our Robots.txt for AI Bots Guide.
📈 CRAWL BEHAVIOR PATTERNS BY PLATFORM
Understanding how each platform crawls helps you anticipate traffic patterns and diagnose issues. Here is what we observe in server logs across client sites:
| Platform | Crawl Architecture | Crawl Volume | Crawl Freshness | Discovery Method |
|---|---|---|---|---|
| OpenAI (GPTBot) | Broad autonomous crawl | High (thousands/day for large sites) | Days to weeks between re-crawls | Sitemap + link following |
| OpenAI (OAI-SearchBot) | Query-triggered fetch | Moderate | Real-time (fetches on query) | Bing index + query relevance |
| OpenAI (ChatGPT-User) | User-triggered fetch | Low | Real-time | User-provided URLs + live search |
| Anthropic (ClaudeBot) | Session-cached crawl | Moderate | Session-level caching | Link following + direct fetches |
| Perplexity (PerplexityBot) | Index-building crawl | High | Hours to days | Aggressive link following + sitemaps |
| Google (Google-Extended) | Piggybacks on Googlebot | N/A (uses Googlebot data) | Matches Googlebot freshness | Google's standard crawl infrastructure |
| ByteDance (Bytespider) | Broad autonomous crawl | Very high | Days | Aggressive link following |
| Meta (Meta-ExternalAgent) | Broad autonomous crawl | Moderate | Days to weeks | Link following |
| Amazon (AmazonBot) | Selective crawl | Low to moderate | Weeks | Sitemap + selective link following |
| Apple (Applebot) | Selective crawl | Low to moderate | Days to weeks | Sitemap + link following |
Lee (2026) identified the architectural split between "live fetch" platforms (ChatGPT, Claude) and "pre-built index" platforms (Perplexity, Gemini). This distinction explains why freshness management strategies differ by platform. For live-fetch platforms, keeping your server fast and available matters most. For index-based platforms, crawl frequency and sitemap accuracy drive freshness.
The Bottom Line: If PerplexityBot stops crawling a page, that page goes stale in Perplexity's index and may drop from results. Monitor crawl frequency in your server logs and investigate any sudden drops. Our AI Visibility Service includes automated crawl monitoring for all major AI bots.
🛡️ IDENTIFYING AI BOTS IN YOUR SERVER LOGS
Knowing the user agent strings is only useful if you can find them in your logs. Here is a quick grep pattern for the most common web server log formats:
Apache/Nginx combined log format:
grep -E "(GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-User|Claude-SearchBot|PerplexityBot|Perplexity-User|Google-Extended|Bytespider|Meta-ExternalAgent|Amazonbot|Applebot|DuckAssistBot)" access.log
Key things to verify in your logs:
- IP address validation. User agent strings can be spoofed. OpenAI, Google, and most major operators publish their IP ranges. Cross-reference the source IP against published ranges to confirm authenticity.
- Crawl frequency tracking. Count requests per bot per day. Sudden drops indicate robots.txt changes, DNS issues, or server blocks.
- Status code monitoring. AI bots getting 403, 429, or 5xx responses are being blocked, rate-limited, or encountering errors. Each is a lost opportunity.
- Path analysis. Which pages do AI bots visit most? This reveals what content AI platforms consider most valuable from your site.
For a complete walkthrough of server log analysis for AI bots, including parsing scripts and dashboards, see our Track AI Bots Guide.
🔍 COMMON MISTAKES AND MISCONFIGURATIONS
These are the most frequent errors we see in AI bot robots.txt management:
Typos in user agent strings. Robots.txt matching is case-sensitive for some server implementations and case-insensitive for others. Use the exact capitalization shown in this reference to be safe. GPTBot is correct. gptbot, GptBot, and GPTBOT may not match depending on how the crawler parses your file.
Blocking the wrong OpenAI bot. Blocking GPTBot when you meant to block all OpenAI crawling still leaves OAI-SearchBot and ChatGPT-User active. Conversely, blocking ChatGPT-User when you only wanted to block training still leaves GPTBot running.
Confusing FacebookBot with Meta-ExternalAgent. FacebookBot renders link previews. Meta-ExternalAgent collects AI training data. Blocking FacebookBot breaks your social sharing cards. Blocking Meta-ExternalAgent prevents AI training. They serve completely different functions.
Forgetting bingbot is the ChatGPT discovery layer. You can allow all three OpenAI bots and still get zero ChatGPT citations if bingbot cannot index your pages. ChatGPT uses Bing's index for URL discovery.
Not blocking Bytespider when you should. Bytespider is the highest-volume AI crawler on most sites. If server load from AI crawlers is a concern, Bytespider is usually the first one to rate-limit or block.
Assuming Google-Extended controls Google Search. Blocking Google-Extended has zero effect on your traditional Google search rankings. It only controls AI overlay features (Gemini, AI Mode). This confusion has led publishers to block Google-Extended and then wonder why their search rankings did not change.
❓ FREQUENTLY ASKED QUESTIONS
How many AI bot user agents are active in 2026? At least 16 distinct AI-related user agent tokens are actively crawling the web as of March 2026, operated by 10 different companies. This number continues to grow as new AI search products launch and existing operators add specialized crawler variants. The complete list is maintained in the reference table above.
Can AI bots spoof their user agent strings? Legitimate AI bots from major operators (OpenAI, Anthropic, Google, etc.) use consistent, verifiable user agent strings and publish their IP address ranges. However, malicious actors can spoof any user agent string. Always cross-reference the source IP address against the operator's published IP ranges to verify authenticity. If a request claims to be GPTBot but comes from an unrecognized IP, it is not a real OpenAI crawler.
What happens if I block all AI bots? Your content disappears from AI-powered search results across all platforms. Users asking ChatGPT, Perplexity, Claude, or Google AI Mode questions in your topic area will see competitor content instead of yours. Aggarwal et al. (2024) showed that generative engine visibility is a distinct channel from traditional search, so blocking AI bots does not affect your Google Search rankings, but it eliminates a growing traffic and authority channel.
Should I allow AI training bots or only citation bots? This depends on your content strategy. Most publishers in 2026 adopt a "citations yes, training maybe" approach: they allow all search and citation bots (OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, PerplexityBot, Google-Extended) while selectively blocking or allowing training bots (GPTBot, Bytespider). The recommended starting point is to allow everything and only block specific training bots if you have a concrete reason.
How often should I update my AI bot robots.txt rules? Review your AI bot configuration quarterly. New bots emerge regularly, and existing operators update their user agent strings. Check this reference page and our Robots.txt for AI Bots Guide for the latest additions. Also monitor your server logs monthly for unfamiliar AI-related user agent strings that may indicate new crawlers not yet widely documented. Run your site through the AI Visibility Quick Check to test your current configuration.
📚 REFERENCES
- Lee, A. (2026). "Query Intent, Not Google Rank: What Best Predicts AI Citation Behavior." Preprint v5. DOI
- Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). "GEO: Generative Engine Optimization." KDD 2024. DOI
- Kadel, J., See, A., Sinha, R., & Fischer, M. (2024). "BOTracle: A Framework for Discriminating Bots and Humans." arXiv. DOI