The Complete 2026 AI Crawler Stack: GPTBot, ClaudeBot, PerplexityBot, and Every AI Bot You Need to Configure
Your website is being crawled by 15+ AI bots right now — most teams have no idea which, why, or how to control them. This technical guide is the definitive 2026 reference for every major AI crawler: GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended, Meta-ExternalAgent, Applebot-Extended, Bytespider, CCBot, and more. With robots.txt syntax, llms.txt standards, IndexNow integration, CDN-level controls, and a strategic framework for deciding which bots to allow vs block.
Why AI Crawler Configuration Is Now a Core SEO Discipline
In 2026, your website is being crawled by more than 15 distinct AI bots — most teams cannot name them, do not know which to allow, and have no monitoring on which bots are actually consuming their content. This is a strategic gap. AI crawlers are the pipes through which your brand becomes (or fails to become) part of how ChatGPT, Claude, Perplexity, Gemini, Meta AI, and Apple Intelligence answer questions about your category.
Configuring these crawlers is no longer optional. It is now a core technical SEO discipline that sits alongside traditional Google crawler management. Get it right and your brand becomes legible, citeable, and recommendable across every AI assistant your buyers use. Get it wrong and you either leak content to bots you never wanted to serve, or block the very crawlers that would have made your brand visible in AI answers.
This guide is the definitive 2026 reference. It covers every major AI crawler by name and provider, exact robots.txt syntax, the llms.txt emerging standard, IndexNow integration for real-time AI freshness, CDN-level controls, server-side logging strategies, and the strategic framework for deciding what to allow versus block.
Training Crawlers vs Search Crawlers: The Distinction That Changes Everything
The single most important concept in AI crawler management is the distinction between two crawler types. Most websites accidentally block one when they meant to block the other — and the consequences are very different.
Training Crawlers
Training crawlers visit your site to collect content that gets included in the next generation of an AI model's training data. They are not used to answer live user queries today. Blocking them protects your content from being baked into future models — which matters if you publish proprietary or paid content. Examples: GPTBot, Google-Extended, ClaudeBot (training variant), Applebot-Extended.
Search and Citation Crawlers
Search crawlers visit your site in real time when an AI assistant needs to look up current information to answer a user query. These are the crawlers that put your brand into live AI answers with citations. Blocking them is how you become invisible in AI search results today. Examples: OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, GoogleOther, Claude-Web.
The strategic implication is critical: you can block training crawlers while allowing search crawlers — letting AI assistants cite you live without contributing your content to training data. Or you can allow both. Or block both. But you must make the decision consciously, per crawler, not with one blanket rule.
The Complete 2026 AI Crawler Stack
Here is every major AI crawler you need to know about in 2026, organized by provider, with crawler type, purpose, and configuration directives.
OpenAI Crawlers
OpenAI operates three distinct crawlers, each with a different purpose. Understanding the difference is essential.
- GPTBot: Training crawler. Collects content used to train future GPT models. Documented at openai.com/gptbot. Respects robots.txt directives. User-agent:
GPTBot - OAI-SearchBot: Search crawler. Used by ChatGPT's browsing feature and SearchGPT to fetch real-time content for live citations. Allowing this puts your brand into ChatGPT answers today. User-agent:
OAI-SearchBot - ChatGPT-User: User-triggered crawler. Activated when a ChatGPT user explicitly asks the assistant to visit your URL. Blocking this prevents users from using ChatGPT to read your content. User-agent:
ChatGPT-User
Anthropic Crawlers
Anthropic operates several crawlers for Claude. Naming has evolved — the current canonical list:
- ClaudeBot: Primary training and indexing crawler for Claude. User-agent:
ClaudeBot - anthropic-ai: Legacy crawler name still used by some Anthropic infrastructure. User-agent:
anthropic-ai - claude-web: Real-time search crawler invoked when Claude needs live web data to answer a user query. User-agent:
claude-web
Perplexity Crawlers
Perplexity is one of the most aggressive AI search engines in citing live web sources. Its crawlers:
- PerplexityBot: Indexing crawler. Collects content for Perplexity's search index. User-agent:
PerplexityBot - Perplexity-User: Real-time user-triggered crawler invoked when a Perplexity user submits a query that needs fresh web content. User-agent:
Perplexity-User
Google AI Crawlers
Google maintains the most carefully separated bot architecture in the industry, designed to let publishers control AI-specific access without affecting traditional Google Search visibility.
- Google-Extended: Controls whether your content is used to train Google's Gemini models and improve generative AI products. Does NOT affect Google Search indexing. User-agent:
Google-Extended - GoogleOther: Used by Google research teams and various internal AI-related fetches. User-agent:
GoogleOther - Googlebot: The traditional Google Search crawler — not an AI crawler, but listed here for clarity because publishers often confuse it with AI bots. User-agent:
Googlebot
Meta AI Crawlers
Meta's AI crawlers serve Meta AI (the assistant inside Instagram, WhatsApp, Messenger, and meta.ai) plus the Llama family training pipeline.
- Meta-ExternalAgent: Primary AI crawler for Meta AI products and Llama training. User-agent:
Meta-ExternalAgent - Meta-ExternalFetcher: Real-time fetcher for Meta AI user queries. User-agent:
Meta-ExternalFetcher - FacebookBot: Legacy social-card crawler — not strictly an AI crawler, but Meta has begun routing some AI fetches through it. User-agent:
FacebookBot
Apple Intelligence Crawlers
Apple introduced AI-specific crawler controls in 2024 to support Apple Intelligence training while keeping its traditional search crawler unaffected.
- Applebot: Traditional Apple search crawler used for Siri, Spotlight, and Safari suggestions. Not AI-specific. User-agent:
Applebot - Applebot-Extended: Controls whether your content is used to train Apple's foundation models. User-agent:
Applebot-Extended
Other Major AI Crawlers
- Amazonbot: Used by Amazon Alexa, Q, and other Amazon AI products. User-agent:
Amazonbot - Bytespider: ByteDance's crawler powering Doubao and TikTok's AI features. Notably aggressive crawler — many sites choose to block it. User-agent:
Bytespider - CCBot: Common Crawl. Not owned by an AI company, but Common Crawl data is included in the training corpora of nearly every major LLM. Blocking CCBot effectively blocks downstream training across many models. User-agent:
CCBot - cohere-ai: Cohere's training crawler. User-agent:
cohere-ai - Diffbot: Specialized knowledge-graph crawler used by some enterprise AI customers. User-agent:
Diffbot - MistralAI-User: Mistral's real-time fetcher for Le Chat and similar products. User-agent:
MistralAI-User - Bingbot: Microsoft Bing's crawler, which feeds Microsoft Copilot's web grounding. Traditionally a search crawler, increasingly important for AI visibility because Copilot uses Bing as its retrieval layer. User-agent:
Bingbot
The Complete robots.txt Configuration for 2026
Here is a recommended starting robots.txt configuration for a typical brand that wants maximum AI visibility while retaining strategic control. Adjust per your business model.
The pattern below: allow all search/citation crawlers (these put you in live AI answers), allow training crawlers from major Western providers (your content becomes part of model knowledge), and block aggressive or low-value crawlers (Bytespider, Diffbot).
Recommended robots.txt Template
This template is structured as named blocks for clarity. In your actual robots.txt file, the syntax follows the standard User-agent + Allow / Disallow + Crawl-delay + Sitemap pattern.
- Allow OpenAI crawlers: User-agent: GPTBot, Allow: /. User-agent: OAI-SearchBot, Allow: /. User-agent: ChatGPT-User, Allow: /.
- Allow Anthropic crawlers: User-agent: ClaudeBot, Allow: /. User-agent: anthropic-ai, Allow: /. User-agent: claude-web, Allow: /.
- Allow Perplexity crawlers: User-agent: PerplexityBot, Allow: /. User-agent: Perplexity-User, Allow: /.
- Allow Google AI crawlers: User-agent: Google-Extended, Allow: /. User-agent: GoogleOther, Allow: /.
- Allow Meta crawlers: User-agent: Meta-ExternalAgent, Allow: /. User-agent: Meta-ExternalFetcher, Allow: /.
- Allow Apple Intelligence: User-agent: Applebot, Allow: /. User-agent: Applebot-Extended, Allow: /.
- Allow Amazon: User-agent: Amazonbot, Allow: /.
- Allow Common Crawl: User-agent: CCBot, Allow: /.
- Allow Cohere: User-agent: cohere-ai, Allow: /.
- Allow Mistral: User-agent: MistralAI-User, Allow: /.
- Block Bytespider (optional): User-agent: Bytespider, Disallow: /. (Reason: aggressive crawler with limited Western AI footprint)
- Block Diffbot (optional): User-agent: Diffbot, Disallow: /. (Reason: serves enterprise data extraction, not direct AI search visibility)
- Sitemap: Sitemap: https://yourdomain.com/sitemap.xml
Granular robots.txt Strategy: Allow Search, Block Training
If you want your brand to be cited in live AI answers but do NOT want your content used to train future foundation models, configure crawlers asymmetrically. Allow search crawlers (OAI-SearchBot, ChatGPT-User, Perplexity-User, claude-web, Meta-ExternalFetcher) while disallowing training crawlers (GPTBot, Google-Extended, Applebot-Extended, ClaudeBot, CCBot). This is the right strategy for paywalled publishers, premium content sites, and brands with sensitive proprietary information.
Crawl-Delay Considerations
Some AI crawlers are aggressive. If your servers are seeing performance impact from AI crawling, you can set a Crawl-delay directive. Bytespider and PerplexityBot in particular have been documented as high-volume crawlers in 2025. A 10-second crawl delay (Crawl-delay: 10) is a reasonable starting point if you experience load issues. Note that not all crawlers honor crawl-delay — some only respect it via custom rate-limiting headers.
The llms.txt Standard: A Quick Reference
llms.txt is an emerging web standard proposed in 2024 and rapidly gaining adoption in 2025–2026. It is a single Markdown file placed at the root of your domain (/llms.txt) that gives AI models a structured, AI-optimized summary of your site. Think of it as the AI equivalent of robots.txt or sitemap.xml — but instead of telling crawlers what they can access, it tells them what your site is about.
What llms.txt Contains
A well-structured llms.txt file includes:
- Your brand name and a concise description of what you do
- Your value proposition and category positioning
- Links to your most important pages (product, pricing, documentation, FAQ)
- Key facts AI models should know about your brand
- Links to deeper resources (case studies, comparison pages, technical docs)
- Optionally: a second file at
/llms-full.txtcontaining expanded content for richer AI consumption
Why llms.txt Matters
AI models do not crawl your entire site every time they need to answer a query. They synthesize cached and indexed content. A well-crafted llms.txt file gives AI models a curated, brand-authored summary of who you are — reducing the chance of misrepresentation and increasing the chance of accurate citation. Major brands including Anthropic, Vercel, and Stripe now publish llms.txt files at their domain roots.
Implementing llms.txt
Create a Markdown file at https://yourdomain.com/llms.txt. Keep it focused and accurate. Update it whenever your positioning, pricing, or core offering changes. Use H1 for your brand name, a single paragraph description, then a list of links with one-line annotations. Aim for under 500 words in llms.txt and under 5,000 words in optional llms-full.txt.
IndexNow: Real-Time AI Content Freshness
IndexNow is an open protocol originally backed by Microsoft and Yandex that lets you push notifications to search engines the instant your content changes. While not exclusively an AI protocol, IndexNow has become an important AEO tool because AI search engines benefit dramatically from fresh content — and IndexNow is the fastest way to tell them about your updates.
Which Engines Support IndexNow
Bing supports IndexNow natively, which means Microsoft Copilot's web grounding sees your updates instantly. Yandex supports it. Multiple smaller AI search engines have begun adopting it. Notable absences: Google, OpenAI, and Perplexity do not currently consume IndexNow signals directly — but the trend is toward broader adoption.
Implementing IndexNow
Generate an API key (a random string), host it as a text file at https://yourdomain.com/[your-key].txt, then send HTTP POST notifications to https://api.indexnow.org/IndexNow with the URL of any page that has been updated. Most modern CMS platforms (WordPress, Webflow, Shopify) have IndexNow plugins. For Next.js / static-site setups, integrate IndexNow into your deployment pipeline to ping the API on every content publish.
Detecting and Logging AI Crawler Activity
If you do not measure AI crawler activity, you cannot manage it. Most teams have no idea how often GPTBot, ClaudeBot, or PerplexityBot are visiting their site — let alone which pages those bots care about. Build the logging discipline before you build the optimization strategy.
Server Log Analysis
Every AI bot identifies itself in the HTTP User-Agent header. Parse your access logs and group requests by User-Agent. Tools like GoAccess, AWStats, or custom log shippers to BigQuery, Snowflake, or ClickHouse can produce a daily report of AI bot activity. Watch for sudden surges in PerplexityBot or Bytespider — these often indicate your content is being heavily harvested for a specific query category.
IP Range Verification
User-Agent strings can be spoofed. For high-trust use cases, verify that requests claiming to be from GPTBot or PerplexityBot actually originate from the published IP ranges of those providers. OpenAI publishes its crawler IP ranges. Anthropic publishes ClaudeBot ranges. Reverse-DNS verification (ptr lookup) is the gold standard for confirming bot identity.
What to Measure
- Bot visit frequency: Requests per day, per bot, per page
- Top crawled pages: Which pages AI bots are most interested in (these are your highest-AI-value pages)
- HTTP response distribution: Are bots getting 200s, 404s, or 503s? Errors suppress AI visibility
- Crawl recency: Last-seen timestamp per bot per page — stale crawls mean stale AI representations
- Geographic patterns: Some AI bots crawl from specific regions; geo-restricted content needs configured CDNs
To Allow or Block — The Strategic Framework
The biggest mistake teams make is treating AI crawler decisions as binary "allow all" or "block all." The right answer is conditional on your business model. Use this decision framework.
Allow Everything (Default for Most B2B SaaS, DTC, Content Sites)
If your goal is maximum AI visibility, allow every legitimate AI crawler. Your content becomes part of how AI assistants understand your brand. The reputational risk is minimal because reputable AI companies (OpenAI, Anthropic, Google, Apple) operate with documented policies.
Allow Search, Block Training (Premium Content, Paywalled Publishers)
If you sell content access (news, research, premium analysis), allow real-time search crawlers so you appear in AI citations (which drive referral traffic), but block training crawlers so your archive is not folded into model knowledge. New York Times, The Atlantic, and major academic publishers have adopted this stance.
Block Everything Except Verified Search (High-Sensitivity Brands)
Financial services, healthcare providers, and brands with strict regulatory or legal exposure may want to block all training crawlers and most search crawlers, allowing only narrowly scoped real-time fetchers. This is a defensive posture that trades AI visibility for compliance.
Selective Per-Section Strategy
The most sophisticated approach: allow all crawlers on marketing and content pages, but block all crawlers on customer-only resources, internal documentation, or paid content using path-based directives (Disallow: /customers/, Disallow: /docs/internal/). This is the right strategy for B2B SaaS with mixed public/private content.
CDN-Level AI Bot Controls
robots.txt is a polite request — it relies on bots to honor your directives. For enforcement, use CDN-level controls. The major CDN providers now offer dedicated AI bot management.
Cloudflare AI Bot Management
Cloudflare introduced its "Block AI Bots" toggle in 2024, available on all plans. It maintains a maintained list of AI crawlers and blocks them at the edge — before they reach your origin. More granular controls (rate limiting, conditional blocking per User-Agent, custom firewall rules) are available on Pro and Business plans. Many sites use Cloudflare's bot management to allow Western AI bots and block Bytespider and other aggressive crawlers.
Vercel Firewall
Vercel's edge firewall lets you write custom rules matching User-Agent strings and apply allow/deny logic. For Next.js sites hosted on Vercel, this is the most ergonomic way to enforce AI bot policy.
Fastly and AWS CloudFront
Both support custom edge functions (Compute@Edge, Lambda@Edge) that can inspect User-Agent and apply per-bot logic. Higher engineering overhead than Cloudflare's toggle, but full flexibility.
Schema Markup That Helps AI Crawlers
AI crawlers heavily favor structured data because it removes parsing ambiguity. The schemas that deliver the most AEO impact in 2026:
- FAQPage schema: AI models lift FAQ answers verbatim when responding to natural-language queries. Add this to every page with a meaningful FAQ section
- HowTo schema: For step-by-step content, HowTo schema gives AI models extractable instruction sets
- Product schema: Essential for SaaS pricing pages, e-commerce listings, and software comparison content
- Organization schema: Establishes your brand's entity identity — name, logo, sameAs links, founder, founding date
- Article schema: Surface your authors, publication dates, and topical relevance for AI citation
- BreadcrumbList schema: Helps AI models understand your site hierarchy and surface the right depth of content
Common AI Crawler Configuration Mistakes
These are the recurring mistakes that quietly cost teams AI visibility. Audit your robots.txt today against this list.
- Blocking GPTBot but allowing ChatGPT-User: Inconsistent — your content cannot be trained on, but ChatGPT users can still pull it on-demand. Decide one way or the other
- Forgetting Google-Extended: Many sites assume their Googlebot allowance covers Gemini training. It does not. Google-Extended is a separate crawler and requires explicit allowance
- Blocking CCBot without realizing the downstream impact: Common Crawl feeds the training data of nearly every major foundation model. Blocking CCBot is one of the highest-impact AEO blocks you can make — sometimes intended, often accidental
- Using a single Disallow: / for all User-agents: The blanket block. Some teams inherit this from old configurations and don't realize they are invisible to every AI engine
- Not publishing a sitemap.xml: AI crawlers use sitemaps to discover content efficiently. No sitemap means slower, less complete crawling
- Forgetting to update robots.txt as new crawlers emerge: Anthropic added claude-web mid-2024. OpenAI added OAI-SearchBot. Apple added Applebot-Extended. If your robots.txt was last updated in 2023, you have crawler gaps
- Blocking AI crawlers via CDN while leaving robots.txt open: Mixed signals confuse monitoring tools. Be consistent across robots.txt and CDN policies
Monitoring AI Crawler Activity at Scale
For brands serious about AEO, ongoing AI crawler monitoring is essential. The standard practice in 2026 is to maintain a dashboard that surfaces:
- Daily request counts per crawler
- Crawler activity per content section (marketing, docs, pricing, blog)
- Crawl error rates per bot
- New bots appearing in your logs (early warning of new AI products)
- Correlation between crawler activity and AI citation lift
The simplest implementation: pipe your CDN or web server logs to a data warehouse (BigQuery, Snowflake, ClickHouse), build a dashboard in Metabase or Looker. The more sophisticated implementation: use a purpose-built AEO platform that combines crawler monitoring with AI visibility tracking — so you can directly correlate "crawler visited page X" with "AI mentions us in queries about X."
What's Coming in 2027
Three trends are accelerating that technical teams should prepare for.
First, more granular AI crawler controls are emerging. Beyond simple allow/disallow, providers are introducing per-purpose directives (training-only, search-only, summarization-only) that let publishers opt into specific use cases. Expect Apple, OpenAI, and Anthropic to lead this evolution.
Second, compensation models for AI crawling are forming. Cloudflare's "Pay Per Crawl" proposal and ProRata's licensing model suggest publishers may soon be able to charge AI crawlers for access. Robots.txt will evolve to include pricing directives alongside allow/disallow.
Third, verification standards are tightening. Spoofed User-Agents pretending to be GPTBot are increasingly common. Expect cryptographic verification (signed crawler requests, verified IP ranges) to become standard, with non-verified bots auto-blocked at the CDN edge.
The Bottom Line: AI Crawler Configuration Is a Living System
The brands that win AI visibility in 2026 and 2027 are the ones treating AI crawler configuration as an ongoing technical discipline — not a one-time setup. They audit robots.txt quarterly, monitor crawler logs weekly, update llms.txt with every major positioning change, and integrate IndexNow into their content publishing pipeline. They make conscious decisions about which crawlers to allow per business context, and they measure the AI visibility outcomes that follow.
Sourceable closes the feedback loop between AI crawler configuration and AI visibility outcomes. We monitor your brand across ChatGPT, Claude, Gemini, Perplexity, and Meta AI — so when you allow GPTBot or update your llms.txt or push IndexNow notifications, you can see the citation lift in real time. We surface which prompts you appear in, which competitors are cited where you should be, and which AI engines are pulling from your content. The robots.txt + llms.txt + IndexNow stack is the technical foundation. Sourceable is the measurement layer that proves it is working.
Start with a free AI Visibility Report. Audit your current AI footprint, see which bots are crawling you, and discover where the biggest visibility gaps are. The 2026 AI crawler stack is ready. The question is whether your brand is configured to be cited by it — or quietly invisible to half the search engines your buyers now use.
More from Sourceable
Continue reading our latest insights
AEO for E-commerce & DTC Brands: The Complete 2026 Playbook for Winning AI Shopping Queries
AI assistants are now the first stop in online shopping. ChatGPT, Claude, Gemini, and Perplexity recommend products before buyers ever open Amazon or Google. This vertical playbook is the definitive 2026 guide to AEO for e-commerce and DTC brands — covering the Amazon-Trustpilot-Reddit authority triangle, product schema implementation, review strategy, comparison query optimization, visual search readiness, pricing transparency, conversational commerce, and the full 90-day action plan.
AEO for B2B SaaS: The Complete Vertical Playbook for Getting Recommended by AI in 2026
B2B SaaS buyers now ask ChatGPT, Claude, and Perplexity for vendor recommendations before they ever visit your website. This vertical playbook explains exactly how B2B SaaS companies should approach Answer Engine Optimization — from the G2-Capterra-Reddit authority triangle, to comparison pages, to ICP-specific content, to pricing transparency signals AI models actually weigh. Every tactic is mapped to the B2B SaaS buying journey.