AI companies run web crawlers — like Googlebot, but for AI — to fetch your pages for three different jobs: training future models, building a search index for live answers, and fetching a page on demand when a user asks about it. You control each by its user-agent token in robots.txt. Block the wrong bot and you quietly vanish from an answer engine; leave them open and you stay fully readable.
| User-agent | Operator | Purpose | What to do |
|---|---|---|---|
| GPTBot | OpenAI | Model training | Allow to be in future models; block to opt out of training. Does not affect ChatGPT search. |
| OAI-SearchBot | OpenAI | ChatGPT search index | Allow — this is how you appear in ChatGPT's search answers. |
| ChatGPT-User | OpenAI | Live user fetch | Allow — fetches your page when a user asks about it in ChatGPT. |
| ClaudeBot | Anthropic | Index / training | Allow to be readable and citable by Claude. |
| Claude-User | Anthropic | Live user fetch | Allow — fetches on a Claude user's behalf. |
| PerplexityBot | Perplexity | Search index | Allow — required to be cited in Perplexity answers. |
| Perplexity-User | Perplexity | Live user fetch | Allow — user-initiated fetch. |
| Google-Extended | Gemini / AI training | Controls Gemini & AI training only. Does NOT affect Google Search indexing. | |
| Applebot-Extended | Apple | Apple Intelligence training | Opt-out token for Apple AI training; Applebot still crawls for Siri/Spotlight. |
| CCBot | Common Crawl | Open dataset (feeds many AIs) | Allow to be in the open corpus; block to stay out of it. |
| Bytespider | ByteDance | Training (Doubao / TikTok) | Often blocked — known for aggressive crawling. |
| Amazonbot | Amazon | Alexa / search / AI | Allow for Alexa answers and Amazon's AI features. |
| Meta-ExternalAgent | Meta | AI training / fetch | Controls Meta AI's training and on-demand fetching. |
Match these tokens exactly in your robots.txt rules — a typo means the rule silently does nothing.
Blocking a training crawler does not remove you from that company's answers, and vice versa. The classic mistake: blocking GPTBot to "opt out of AI", then wondering why you are not in ChatGPT — when the bot behind ChatGPT's search answers is actually OAI-SearchBot, a separate token. Same with Google: Google-Extended only governs Gemini and AI training and has zero effect on normal Google Search. Decide training and visibility separately.
Allow the search and user-fetch bots so you stay citable; optionally opt out of pure training crawlers. Add this to the robots.txt at your site root:
# Allow the AI search + user-fetch bots so you stay citable User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: PerplexityBot Allow: / # Optional: opt out of pure model-training crawlers User-agent: GPTBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: /
One stray Disallow is all it takes to disappear from an answer engine — verify the live file, do not assume.
Across the 154 leading sites Oraql audited for our 2026 State of AI Search Readiness report, only 5% block a major AI crawler outright — but far more lose visibility a subtler way: content that needs JavaScript to render, which most of these crawlers will not run. Crawler access is necessary but not sufficient; the page also has to be readable once fetched.
Not sure which bots your robots.txt allows? Our free robots.txt AI-crawler checker tells you instantly. Then run the full AI Search Readiness audit — a 0-100 score, an A-F grade, and a prioritized fix list across the seven signals that decide whether AI can read and recommend you.
Related: how to show up in ChatGPT · what is AEO