AI TOOLS

AI Bot User Agents List 2026: Every Crawler String, Purpose, and Robots.txt Rule

2026-03-24

There are now more than 15 distinct AI bot user agents crawling the web, each with different strings, different purposes, and different robots.txt compliance behaviors. If you do not know every one of them by name, you are almost certainly blocking traffic you want or allowing access you do not.

The AI crawler landscape changed dramatically between 2024 and 2026. What was once a handful of experimental bots has become a full ecosystem of specialized crawlers, each operated by a major AI company, each serving a distinct function. Some crawl for model training data. Some crawl for real-time search answers. Some do both. And the user agent strings they send are the only way to tell them apart.

This reference page catalogs every known AI bot user agent active in 2026. For each bot, you get the exact user agent string, the operator, the stated purpose, whether it respects robots.txt, and practical notes on crawl behavior. Bookmark this page. You will need it.

The Bottom Line: Managing AI crawl access requires knowing the exact user agent string for each bot. A single typo in your robots.txt means the rule does nothing. This reference gives you every string you need, verified against official documentation as of March 2026.

If you want to see which of these bots are already hitting your site, our Track AI Bots Guide walks through server log analysis step by step. For robots.txt configuration, see our Robots.txt for AI Bots Guide.

🤖 WHY AI BOT USER AGENTS MATTER NOW

Traditional web crawlers were simple. Googlebot, Bingbot, a handful of others. You knew them. You configured your robots.txt once, and the job was done for years.

AI bots broke that model. Research from Lee (2026) found that ChatGPT and Claude perform live page fetches for search queries, while Perplexity and Gemini rely on pre-built indices. That architectural difference means blocking one bot has very different consequences than blocking another. Block GPTBot and you lose training inclusion but keep ChatGPT search citations (via ChatGPT-User). Block ChatGPT-User and you lose real-time citations entirely. The distinction matters.

Aggarwal et al. (2024) demonstrated that targeted optimization for generative engines can boost content visibility by up to 40%. But optimization is meaningless if the crawler cannot reach your pages. Crawl access is the foundation layer, and user agent identification is how you control it.

The proliferation of AI bots also creates a measurement challenge. Kadel et al. (2024) found that modern bots are increasingly difficult to distinguish from real users when relying on behavioral signals alone, making user agent string verification even more critical as a first-pass identification method.

The Bottom Line: You cannot optimize for AI platforms you are accidentally blocking. And you cannot block AI platforms you do not know exist. This list solves both problems.

📋 THE COMPLETE AI BOT USER AGENTS TABLE (2026)

This is the core reference. Every AI bot user agent known to be active as of March 2026, organized by operator.

OpenAI Bots

Field	GPTBot	OAI-SearchBot	ChatGPT-User
User Agent Token	`GPTBot`	`OAI-SearchBot`	`ChatGPT-User`
Full UA String	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)`	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot)`	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot)`
Operator	OpenAI	OpenAI	OpenAI
Purpose	Training data collection and general web crawling	Web search for AI-generated search results	Real-time page fetching when users request URLs or ChatGPT performs live search
Robots.txt Compliance	Yes	Yes	Yes
Crawl Pattern Notes	Crawls broadly. High volume. Respects crawl-delay. Distinct from search/citation bots. Blocking this does NOT block ChatGPT search citations.	Powers the search results in ChatGPT's search feature. Fetches pages for inclusion in search-grounded responses.	Fetches specific URLs during user conversations. Lower volume but targeted. Triggered by user queries in real time.

OpenAI's three-bot architecture is the most important distinction in AI crawl management. GPTBot is the training crawler. OAI-SearchBot handles search result retrieval. ChatGPT-User handles live browsing during conversations. You can block GPTBot (training) while allowing OAI-SearchBot and ChatGPT-User (citations). This is the recommended configuration for publishers who want AI citation visibility without contributing training data. For a deeper analysis of how these bots interact, see our post on OpenAI's split personality bot architecture.

Anthropic Bots

Field	ClaudeBot	Claude-User	Claude-SearchBot
User Agent Token	`ClaudeBot`	`Claude-User`	`Claude-SearchBot`
Full UA String	`Mozilla/5.0 (compatible; ClaudeBot/1.0; +https://www.anthropic.com/crawlers)`	`Mozilla/5.0 (compatible; Claude-User/1.0; +https://www.anthropic.com/crawlers)`	`Mozilla/5.0 (compatible; Claude-SearchBot/1.0; +https://www.anthropic.com/crawlers)`
Operator	Anthropic	Anthropic	Anthropic
Purpose	Training data collection and index building	Real-time page fetching during user conversations	Web search result retrieval for search-grounded responses
Robots.txt Compliance	Yes	Yes	Yes
Crawl Pattern Notes	Session-cached crawling pattern. Fetches pages and caches content for the duration of a conversation session. Less aggressive than GPTBot.	Triggered when users share URLs or Claude performs web lookups during conversations. Low volume, highly targeted.	Retrieves search results to ground Claude's responses in current web data. Operates similarly to OAI-SearchBot.

Anthropic's crawler architecture mirrors OpenAI's three-tier approach. The key behavioral difference is session caching: ClaudeBot tends to cache page content within a conversation session rather than re-fetching on every reference. Lee (2026) confirmed that Claude performs live page fetches (like ChatGPT), placing it in the "real-time retrieval" architectural category rather than the "pre-built index" category used by Perplexity and Gemini.

Perplexity Bots

Field	PerplexityBot	Perplexity-User
User Agent Token	`PerplexityBot`	`Perplexity-User`
Full UA String	`Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)`	`Mozilla/5.0 (compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexitybot)`
Operator	Perplexity AI	Perplexity AI
Purpose	Index building and broad web crawling for search	Real-time page fetching for user-initiated browsing
Robots.txt Compliance	Yes (after 2024 compliance controversy)	Yes
Crawl Pattern Notes	Builds a pre-indexed corpus. Crawls aggressively and frequently. Pages in the Perplexity index get cited in search results. Re-crawl frequency varies by page authority.	Activated when users share specific URLs or request page analysis. Lower volume.

Perplexity operates on a pre-built index model. Unlike ChatGPT and Claude, which fetch pages live during search queries, Perplexity retrieves results from its own index. This means crawl frequency directly affects citation freshness. If PerplexityBot has not re-crawled your page recently, Perplexity may cite outdated content or miss your page entirely.

The 2024 robots.txt controversy is worth noting. Early reports indicated PerplexityBot was not consistently honoring robots.txt directives. Perplexity addressed this in late 2024, and current evidence suggests compliance is now reliable. However, if you are blocking PerplexityBot in your robots.txt, verify the block is actually working by checking your server logs.

Google AI Bots

Field	Google-Extended	GoogleOther
User Agent Token	`Google-Extended`	`GoogleOther`
Full UA String	`Mozilla/5.0 (compatible; Google-Extended)`	`Mozilla/5.0 (compatible; GoogleOther)`
Operator	Google	Google
Purpose	Controls content use for Gemini and AI Mode training/grounding	General-purpose fetcher for non-search Google products
Robots.txt Compliance	Yes	Yes
Crawl Pattern Notes	Blocking Google-Extended prevents content from being used in Gemini and Google AI Mode responses. Does NOT affect regular Google Search indexing. Googlebot remains a separate user agent.	Catch-all fetcher for various Google services. Lower priority for AI-specific management but still relevant for content access control.

Google's approach is different from OpenAI and Anthropic. Instead of separate crawlers for training versus search, Google uses a single control mechanism (Google-Extended) that governs whether your already-indexed content is used for AI features. Your content is crawled by Googlebot regardless. Google-Extended controls whether that content feeds into Gemini and AI Mode. Blocking Google-Extended does not remove you from Google Search results; it only prevents AI feature usage.

This is a critical distinction. Many publishers blocked Google-Extended in 2024 thinking it would reduce Google's use of their content. It does not affect traditional search at all. It only controls AI overlay features.

Other Major AI Bots

Bot	User Agent Token	Full UA String	Operator	Purpose	Robots.txt	Notes
Bytespider	`Bytespider`	`Mozilla/5.0 (compatible; Bytespider; spider-feedback@bytedance.com)`	ByteDance (TikTok)	Training data for ByteDance AI models, including those powering TikTok search features	Yes	Very aggressive crawler. High request volume. Many publishers block this due to volume concerns.
Meta-ExternalAgent	`Meta-ExternalAgent`	`Mozilla/5.0 (compatible; Meta-ExternalAgent/1.0; +https://www.meta.com/en-us/help/1417427585694498/)`	Meta	Training data for Meta AI features across Facebook, Instagram, WhatsApp	Yes	Newer bot (emerged 2024). Distinct from FacebookBot, which handles link preview rendering.
FacebookBot	`FacebookBot`	`Mozilla/5.0 (compatible; FacebookBot/1.0; +https://www.facebook.com/externalhit_uatext.php)`	Meta	Link preview rendering and Open Graph tag retrieval	Yes	NOT an AI training bot. Fetches pages when users share links on Meta platforms. Blocking this breaks social sharing previews.
AmazonBot	`Amazonbot`	`Mozilla/5.0 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)`	Amazon	Powering Alexa answers, Amazon Q, and Amazon search features	Yes	Moderate volume. Focused on commercial and informational content.
AppleBot	`Applebot`	`Mozilla/5.0 (Applebot/0.1; +http://www.apple.com/go/applebot)`	Apple	Siri, Apple Intelligence, Spotlight search, and Safari suggestions	Yes	Also uses `Applebot-Extended` token for AI-specific content use (similar to Google-Extended).
DuckAssistBot	`DuckAssistBot`	`Mozilla/5.0 (compatible; DuckAssistBot/1.1; +https://duckduckgo.com/duckassistbot)`	DuckDuckGo	AI-generated answers in DuckDuckGo search (DuckAssist feature)	Yes	Lower volume. Focused on factual and reference content.
Bingbot	`bingbot`	`Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)`	Microsoft	Web indexing for Bing Search, which also powers ChatGPT URL discovery	Yes	Not strictly an AI bot, but critical for AI visibility. ChatGPT uses Bing's index for URL discovery. If Bingbot cannot crawl your page, ChatGPT may never find it.

The Bottom Line: Bingbot deserves special attention. Even though it is a traditional search crawler, it serves as the discovery layer for ChatGPT's search grounding. Lee (2026) found that ChatGPT URL selection correlates with Bing indexing. Block Bingbot and you reduce your chances of appearing in ChatGPT search results, even if you allow all three OpenAI bots.

📊 COMPARISON: TRAINING BOTS VS. SEARCH BOTS

Not all AI bots serve the same purpose. The most important management decision is distinguishing between bots that crawl for model training and bots that crawl to serve real-time search citations. Here is the breakdown:

Bot	Training	Search/Citation	Recommended Action
GPTBot	Yes	No	Block if you do not want content used for training
OAI-SearchBot	No	Yes	Allow for ChatGPT search visibility
ChatGPT-User	No	Yes	Allow for live citation in conversations
ClaudeBot	Yes	Partial (session cache)	Allow for Claude visibility; block only if training is a concern
Claude-User	No	Yes	Allow for live citation
Claude-SearchBot	No	Yes	Allow for Claude search visibility
PerplexityBot	Yes	Yes (index-based)	Allow for Perplexity visibility; dual-purpose bot
Perplexity-User	No	Yes	Allow for live browsing citations
Google-Extended	Yes (AI features)	Yes (AI Mode)	Allow for Google AI Mode visibility
GoogleOther	Variable	No	Allow unless you have specific concerns
Bytespider	Yes	Minimal	Block if volume is a problem or if you do not target TikTok
Meta-ExternalAgent	Yes	Emerging	Allow if you want Meta AI visibility
AmazonBot	Yes	Yes (Alexa, Amazon Q)	Allow for Amazon ecosystem visibility
Applebot	Yes	Yes (Siri, Apple Intelligence)	Allow for Apple ecosystem visibility
DuckAssistBot	No	Yes	Allow for DuckDuckGo AI answer visibility
bingbot	No (indexing)	Indirect (powers ChatGPT discovery)	Always allow; critical for ChatGPT discoverability

The Bottom Line: The minimum viable "allow" list for AI search visibility includes: OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Google-Extended, Applebot, and bingbot. Block any of these and you lose visibility on the corresponding platform.

🔧 ROBOTS.TXT CONFIGURATION REFERENCE

Here is the practical robots.txt configuration for the three most common scenarios. Copy these directly into your robots.txt file.

Scenario 1: Allow Everything (Maximum Visibility)

Do not add any AI-specific blocks. The default state of robots.txt allows all crawlers. If you have existing blocks, remove them.

Scenario 2: Allow Citations, Block Training

This is the most popular configuration for publishers in 2026. It allows real-time search and citation bots while blocking training-only crawlers:

# Block training-only crawlers
User-agent: GPTBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Allow all citation and search bots (default allow)
# OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-User,
# Claude-SearchBot, PerplexityBot, Perplexity-User,
# Google-Extended, Applebot, DuckAssistBot, bingbot
# all allowed by default (no Disallow rules needed)

Scenario 3: Block All AI Bots

This removes your content from all AI platforms. Only use this if you have a specific business reason:

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: DuckAssistBot
Disallow: /

Note: Do NOT block Applebot or bingbot in this list unless you also want to lose Siri and Bing search visibility entirely, since those bots serve traditional search functions as well.

For a detailed walkthrough of robots.txt strategies for AI bots, see our Robots.txt for AI Bots Guide.

📈 CRAWL BEHAVIOR PATTERNS BY PLATFORM

Understanding how each platform crawls helps you anticipate traffic patterns and diagnose issues. Here is what we observe in server logs across client sites:

Platform	Crawl Architecture	Crawl Volume	Crawl Freshness	Discovery Method
OpenAI (GPTBot)	Broad autonomous crawl	High (thousands/day for large sites)	Days to weeks between re-crawls	Sitemap + link following
OpenAI (OAI-SearchBot)	Query-triggered fetch	Moderate	Real-time (fetches on query)	Bing index + query relevance
OpenAI (ChatGPT-User)	User-triggered fetch	Low	Real-time	User-provided URLs + live search
Anthropic (ClaudeBot)	Session-cached crawl	Moderate	Session-level caching	Link following + direct fetches
Perplexity (PerplexityBot)	Index-building crawl	High	Hours to days	Aggressive link following + sitemaps
Google (Google-Extended)	Piggybacks on Googlebot	N/A (uses Googlebot data)	Matches Googlebot freshness	Google's standard crawl infrastructure
ByteDance (Bytespider)	Broad autonomous crawl	Very high	Days	Aggressive link following
Meta (Meta-ExternalAgent)	Broad autonomous crawl	Moderate	Days to weeks	Link following
Amazon (AmazonBot)	Selective crawl	Low to moderate	Weeks	Sitemap + selective link following
Apple (Applebot)	Selective crawl	Low to moderate	Days to weeks	Sitemap + link following

Lee (2026) identified the architectural split between "live fetch" platforms (ChatGPT, Claude) and "pre-built index" platforms (Perplexity, Gemini). This distinction explains why freshness management strategies differ by platform. For live-fetch platforms, keeping your server fast and available matters most. For index-based platforms, crawl frequency and sitemap accuracy drive freshness.

The Bottom Line: If PerplexityBot stops crawling a page, that page goes stale in Perplexity's index and may drop from results. Monitor crawl frequency in your server logs and investigate any sudden drops. Our AI Visibility Service includes automated crawl monitoring for all major AI bots.

🛡️ IDENTIFYING AI BOTS IN YOUR SERVER LOGS

Knowing the user agent strings is only useful if you can find them in your logs. Here is a quick grep pattern for the most common web server log formats:

Apache/Nginx combined log format:

grep -E "(GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-User|Claude-SearchBot|PerplexityBot|Perplexity-User|Google-Extended|Bytespider|Meta-ExternalAgent|Amazonbot|Applebot|DuckAssistBot)" access.log

Key things to verify in your logs:

IP address validation. User agent strings can be spoofed. OpenAI, Google, and most major operators publish their IP ranges. Cross-reference the source IP against published ranges to confirm authenticity.
Crawl frequency tracking. Count requests per bot per day. Sudden drops indicate robots.txt changes, DNS issues, or server blocks.
Status code monitoring. AI bots getting 403, 429, or 5xx responses are being blocked, rate-limited, or encountering errors. Each is a lost opportunity.
Path analysis. Which pages do AI bots visit most? This reveals what content AI platforms consider most valuable from your site.

For a complete walkthrough of server log analysis for AI bots, including parsing scripts and dashboards, see our Track AI Bots Guide.

🔍 COMMON MISTAKES AND MISCONFIGURATIONS

These are the most frequent errors we see in AI bot robots.txt management:

Typos in user agent strings. Robots.txt matching is case-sensitive for some server implementations and case-insensitive for others. Use the exact capitalization shown in this reference to be safe. GPTBot is correct. gptbot, GptBot, and GPTBOT may not match depending on how the crawler parses your file.

Blocking the wrong OpenAI bot. Blocking GPTBot when you meant to block all OpenAI crawling still leaves OAI-SearchBot and ChatGPT-User active. Conversely, blocking ChatGPT-User when you only wanted to block training still leaves GPTBot running.

Confusing FacebookBot with Meta-ExternalAgent. FacebookBot renders link previews. Meta-ExternalAgent collects AI training data. Blocking FacebookBot breaks your social sharing cards. Blocking Meta-ExternalAgent prevents AI training. They serve completely different functions.

Forgetting bingbot is the ChatGPT discovery layer. You can allow all three OpenAI bots and still get zero ChatGPT citations if bingbot cannot index your pages. ChatGPT uses Bing's index for URL discovery.

Not blocking Bytespider when you should. Bytespider is the highest-volume AI crawler on most sites. If server load from AI crawlers is a concern, Bytespider is usually the first one to rate-limit or block.

Assuming Google-Extended controls Google Search. Blocking Google-Extended has zero effect on your traditional Google search rankings. It only controls AI overlay features (Gemini, AI Mode). This confusion has led publishers to block Google-Extended and then wonder why their search rankings did not change.

❓ FREQUENTLY ASKED QUESTIONS

How many AI bot user agents are active in 2026? At least 16 distinct AI-related user agent tokens are actively crawling the web as of March 2026, operated by 10 different companies. This number continues to grow as new AI search products launch and existing operators add specialized crawler variants. The complete list is maintained in the reference table above.

Can AI bots spoof their user agent strings? Legitimate AI bots from major operators (OpenAI, Anthropic, Google, etc.) use consistent, verifiable user agent strings and publish their IP address ranges. However, malicious actors can spoof any user agent string. Always cross-reference the source IP address against the operator's published IP ranges to verify authenticity. If a request claims to be GPTBot but comes from an unrecognized IP, it is not a real OpenAI crawler.

What happens if I block all AI bots? Your content disappears from AI-powered search results across all platforms. Users asking ChatGPT, Perplexity, Claude, or Google AI Mode questions in your topic area will see competitor content instead of yours. Aggarwal et al. (2024) showed that generative engine visibility is a distinct channel from traditional search, so blocking AI bots does not affect your Google Search rankings, but it eliminates a growing traffic and authority channel.

Should I allow AI training bots or only citation bots? This depends on your content strategy. Most publishers in 2026 adopt a "citations yes, training maybe" approach: they allow all search and citation bots (OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, PerplexityBot, Google-Extended) while selectively blocking or allowing training bots (GPTBot, Bytespider). The recommended starting point is to allow everything and only block specific training bots if you have a concrete reason.

How often should I update my AI bot robots.txt rules? Review your AI bot configuration quarterly. New bots emerge regularly, and existing operators update their user agent strings. Check this reference page and our Robots.txt for AI Bots Guide for the latest additions. Also monitor your server logs monthly for unfamiliar AI-related user agent strings that may indicate new crawlers not yet widely documented. Run your site through the AI Visibility Quick Check to test your current configuration.

📚 REFERENCES

Lee, A. (2026). "Query Intent, Not Google Rank: What Best Predicts AI Citation Behavior." Preprint v5. DOI
Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). "GEO: Generative Engine Optimization." KDD 2024. DOI
Kadel, J., See, A., Sinha, R., & Fischer, M. (2024). "BOTracle: A Framework for Discriminating Bots and Humans." arXiv. DOI