Learn / AI crawlers

AI crawlers and user agents: the complete reference

Every major AI crawler, the exact user-agent token it sends, what it is for, and whether to allow or block it in robots.txt. If an AI engine cannot crawl you, it cannot cite you — so this is where AI visibility starts.

The short version

AI companies run web crawlers — like Googlebot, but for AI — to fetch your pages for three different jobs: training future models, building a search index for live answers, and fetching a page on demand when a user asks about it. You control each by its user-agent token in robots.txt. Block the wrong bot and you quietly vanish from an answer engine; leave them open and you stay fully readable.

The crawlers, at a glance

User-agentOperatorPurposeWhat to do
GPTBotOpenAIModel trainingAllow to be in future models; block to opt out of training. Does not affect ChatGPT search.
OAI-SearchBotOpenAIChatGPT search indexAllow — this is how you appear in ChatGPT's search answers.
ChatGPT-UserOpenAILive user fetchAllow — fetches your page when a user asks about it in ChatGPT.
ClaudeBotAnthropicIndex / trainingAllow to be readable and citable by Claude.
Claude-UserAnthropicLive user fetchAllow — fetches on a Claude user's behalf.
PerplexityBotPerplexitySearch indexAllow — required to be cited in Perplexity answers.
Perplexity-UserPerplexityLive user fetchAllow — user-initiated fetch.
Google-ExtendedGoogleGemini / AI trainingControls Gemini & AI training only. Does NOT affect Google Search indexing.
Applebot-ExtendedAppleApple Intelligence trainingOpt-out token for Apple AI training; Applebot still crawls for Siri/Spotlight.
CCBotCommon CrawlOpen dataset (feeds many AIs)Allow to be in the open corpus; block to stay out of it.
BytespiderByteDanceTraining (Doubao / TikTok)Often blocked — known for aggressive crawling.
AmazonbotAmazonAlexa / search / AIAllow for Alexa answers and Amazon's AI features.
Meta-ExternalAgentMetaAI training / fetchControls Meta AI's training and on-demand fetching.

Match these tokens exactly in your robots.txt rules — a typo means the rule silently does nothing.

Search vs training — the distinction that trips everyone up

Blocking a training crawler does not remove you from that company's answers, and vice versa. The classic mistake: blocking GPTBot to "opt out of AI", then wondering why you are not in ChatGPT — when the bot behind ChatGPT's search answers is actually OAI-SearchBot, a separate token. Same with Google: Google-Extended only governs Gemini and AI training and has zero effect on normal Google Search. Decide training and visibility separately.

A sensible robots.txt starting point

Allow the search and user-fetch bots so you stay citable; optionally opt out of pure training crawlers. Add this to the robots.txt at your site root:

# Allow the AI search + user-fetch bots so you stay citable
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Optional: opt out of pure model-training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

One stray Disallow is all it takes to disappear from an answer engine — verify the live file, do not assume.

How common is blocking, really?

Across the 154 leading sites Oraql audited for our 2026 State of AI Search Readiness report, only 5% block a major AI crawler outright — but far more lose visibility a subtler way: content that needs JavaScript to render, which most of these crawlers will not run. Crawler access is necessary but not sufficient; the page also has to be readable once fetched.

Check your site in seconds

Not sure which bots your robots.txt allows? Our free robots.txt AI-crawler checker tells you instantly. Then run the full AI Search Readiness audit — a 0-100 score, an A-F grade, and a prioritized fix list across the seven signals that decide whether AI can read and recommend you.

Run a free audit →

Related: how to show up in ChatGPT · what is AEO