Name: w2agent
Author: w2agent

AI Crawlers Are Not Search Engines

Traditional search engine crawlers (Googlebot, Bingbot) index your pages for search results. AI crawlers serve a different purpose: they fetch content to train models, power AI search (Perplexity, SearchGPT), or enable AI assistants to answer questions about your site.

The key difference: blocking Googlebot removes you from search results. Blocking AI crawlers removes you from AI-powered answers and recommendations — an increasingly important channel.

Known AI Bot User-Agents

User-Agent	Company	Purpose
GPTBot	OpenAI	ChatGPT web browsing and training
ChatGPT-User	OpenAI	Real-time browsing by ChatGPT users
ClaudeBot	Anthropic	Claude web search and training
anthropic-ai	Anthropic	Anthropic model training
PerplexityBot	Perplexity	Perplexity AI search results
Bytespider	ByteDance	TikTok / Doubao AI training
Google-Extended	Google	Gemini AI training (separate from Googlebot)
cohere-ai	Cohere	Cohere model training
Applebot-Extended	Apple	Apple Intelligence training

Configuration Examples

Allow all AI bots (recommended)

# Allow AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Allow with restrictions

# Allow AI bots but block admin and private pages
User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /account/
Disallow: /checkout/

User-agent: ClaudeBot
Allow: /
Disallow: /admin/
Disallow: /account/

Block training but allow browsing

# Allow real-time AI search, block training crawlers
User-agent: ChatGPT-User
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Common Mistakes

Blanket wildcard blocks

"User-agent: * / Disallow: /" blocks everything — including AI bots. If you use this pattern, you must add explicit Allow rules for AI User-Agents above it.

Security plugin overrides

WordPress security plugins (Wordfence, Sucuri) can add User-Agent blocks that override your robots.txt. Check your plugin settings separately.

Forgetting ChatGPT-User

GPTBot is for training; ChatGPT-User is for real-time browsing. Blocking GPTBot doesn't block ChatGPT's live browsing, and vice versa.

No Sitemap directive

Always include a Sitemap directive at the bottom of robots.txt. AI crawlers use it to discover your pages efficiently.

robots.txt cached by CDN

If you update robots.txt but your CDN serves a cached version, crawlers won't see the change. Purge your CDN cache after updates.

Test Your Configuration

w2agent checks your robots.txt against all known AI User-Agents and reports exactly which bots are allowed, blocked, or partially restricted. It also generates optimized robots.txt rules as part of its output.

Testing Your robots.txt Manually

Before running a full audit, you can test your robots.txt against specific User-Agents with curl. This simulates exactly what a bot sees when it checks your rules:

Fetch robots.txt and check it yourself

# Fetch your robots.txt
curl -s https://your-site.com/robots.txt

# Check if a specific URL would be allowed
# (robots.txt parsers follow specific precedence rules — test with a tool)
curl -A "GPTBot" -s -o /dev/null -w "%{http_code}" https://your-site.com/

# If you get 200 — accessible. 403/401/503 — blocked at server level.
# robots.txt is a separate layer — a 200 on the URL doesn't mean robots.txt allows it.

Note that curl -A tests whether your server blocks the User-Agent at the HTTP level (WAF, CDN, security plugins). robots.txt rules are evaluated separately by the bot's own parser. Both layers matter — see AI Crawler Blocking for the full picture.

New and Updated AI Bots (2024–2025)

The AI crawler landscape changes rapidly. Several new User-Agents appeared or expanded in 2024–2025 that aren't in most robots.txt guides:

OAI-SearchBot

OpenAI's newer search indexing bot, separate from GPTBot. Appeared mid-2024. Add explicit rules if you're configuring GPTBot.

Meta-ExternalAgent

Meta's crawler for AI features in Facebook, Instagram, and WhatsApp. Uses a distinct User-Agent from Meta-ExternalFetcher.

YouBot

You.com's AI search crawler. Respects robots.txt but uses a low-profile User-Agent that's easy to miss in access logs.

Amazonbot

Amazon's crawler powers Alexa AI and Amazon Q. Growing in importance as Amazon invests in AI-powered search.

Check your w2agent score — the Bot Accessibility category tests against all known AI User-Agents, including these newer ones. If you have llms.txt deployed, make sure none of these bots are blocked from reading it.

Why AI Crawlers Get Blocked — robots.txt is one cause; learn the others that affect AI access.
What is llms.txt? — The file AI crawlers fetch once your robots.txt lets them through.
AI Readiness Audit — See how Bot Accessibility fits into your overall AI readiness score.

Score your site now

Get your free w2agent score and generate the files your site needs.

Get Your Score

robots.txt for AI Bots