AI Crawlers Are Not Search Engines
Traditional search engine crawlers (Googlebot, Bingbot) index your pages for search results. AI crawlers serve a different purpose: they fetch content to train models, power AI search (Perplexity, SearchGPT), or enable AI assistants to answer questions about your site.
The key difference: blocking Googlebot removes you from search results. Blocking AI crawlers removes you from AI-powered answers and recommendations — an increasingly important channel.
Known AI Bot User-Agents
| User-Agent | Company | Purpose |
|---|---|---|
| GPTBot | OpenAI | ChatGPT web browsing and training |
| ChatGPT-User | OpenAI | Real-time browsing by ChatGPT users |
| ClaudeBot | Anthropic | Claude web search and training |
| anthropic-ai | Anthropic | Anthropic model training |
| PerplexityBot | Perplexity | Perplexity AI search results |
| Bytespider | ByteDance | TikTok / Doubao AI training |
| Google-Extended | Gemini AI training (separate from Googlebot) | |
| cohere-ai | Cohere | Cohere model training |
| Applebot-Extended | Apple | Apple Intelligence training |
Configuration Examples
Allow all AI bots (recommended)
# Allow AI crawlers User-agent: GPTBot Allow: / User-agent: ClaudeBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Google-Extended Allow: /
Allow with restrictions
# Allow AI bots but block admin and private pages User-agent: GPTBot Allow: / Disallow: /admin/ Disallow: /account/ Disallow: /checkout/ User-agent: ClaudeBot Allow: / Disallow: /admin/ Disallow: /account/
Block training but allow browsing
# Allow real-time AI search, block training crawlers User-agent: ChatGPT-User Allow: / User-agent: GPTBot Disallow: / User-agent: Google-Extended Disallow: /
Common Mistakes
Blanket wildcard blocks
"User-agent: * / Disallow: /" blocks everything — including AI bots. If you use this pattern, you must add explicit Allow rules for AI User-Agents above it.
Security plugin overrides
WordPress security plugins (Wordfence, Sucuri) can add User-Agent blocks that override your robots.txt. Check your plugin settings separately.
Forgetting ChatGPT-User
GPTBot is for training; ChatGPT-User is for real-time browsing. Blocking GPTBot doesn't block ChatGPT's live browsing, and vice versa.
No Sitemap directive
Always include a Sitemap directive at the bottom of robots.txt. AI crawlers use it to discover your pages efficiently.
robots.txt cached by CDN
If you update robots.txt but your CDN serves a cached version, crawlers won't see the change. Purge your CDN cache after updates.
Test Your Configuration
w2agent checks your robots.txt against all known AI User-Agents and reports exactly which bots are allowed, blocked, or partially restricted. It also generates optimized robots.txt rules as part of its output.
Testing Your robots.txt Manually
Before running a full audit, you can test your robots.txt against specific User-Agents with curl. This simulates exactly what a bot sees when it checks your rules:
Fetch robots.txt and check it yourself
# Fetch your robots.txt
curl -s https://your-site.com/robots.txt
# Check if a specific URL would be allowed
# (robots.txt parsers follow specific precedence rules — test with a tool)
curl -A "GPTBot" -s -o /dev/null -w "%{http_code}" https://your-site.com/
# If you get 200 — accessible. 403/401/503 — blocked at server level.
# robots.txt is a separate layer — a 200 on the URL doesn't mean robots.txt allows it.Note that curl -A tests whether your server blocks the User-Agent at the HTTP level (WAF, CDN, security plugins). robots.txt rules are evaluated separately by the bot's own parser. Both layers matter — see AI Crawler Blocking for the full picture.
New and Updated AI Bots (2024–2025)
The AI crawler landscape changes rapidly. Several new User-Agents appeared or expanded in 2024–2025 that aren't in most robots.txt guides:
OAI-SearchBot
OpenAI's newer search indexing bot, separate from GPTBot. Appeared mid-2024. Add explicit rules if you're configuring GPTBot.
Meta-ExternalAgent
Meta's crawler for AI features in Facebook, Instagram, and WhatsApp. Uses a distinct User-Agent from Meta-ExternalFetcher.
YouBot
You.com's AI search crawler. Respects robots.txt but uses a low-profile User-Agent that's easy to miss in access logs.
Amazonbot
Amazon's crawler powers Alexa AI and Amazon Q. Growing in importance as Amazon invests in AI-powered search.
Check your w2agent score — the Bot Accessibility category tests against all known AI User-Agents, including these newer ones. If you have llms.txt deployed, make sure none of these bots are blocked from reading it.
Related Articles
- Why AI Crawlers Get Blocked — robots.txt is one cause; learn the others that affect AI access.
- What is llms.txt? — The file AI crawlers fetch once your robots.txt lets them through.
- AI Readiness Audit — See how Bot Accessibility fits into your overall AI readiness score.
Score your site now
Get your free w2agent score and generate the files your site needs.
Get Your Score