The management of LLM crawlers is currently one of the most critical dilemmas for Italian publishers and companies. By 2026, Traffic from AI search grew by 42.81% year over year, transforming visibility in ChatGPT, Perplexity, and Claude responses into a discovery channel on par with traditional Google rankings. Yet, About 30% of websites accidentally block the most important AI crawlers — often without knowing anything about it.
The central problem is technical confusion. An administrator reads alarming headlines about “AI scrapers” and adds a rule Disallow: / generic in the file robots.txt, believing they were protecting the content. The result? The site disappears from ChatGPT Search, Perplexity, and Google's AI Overviews, losing a high-intent traffic channel that 4.4 times better than traditional organic search.
This guide addresses the technical reality of 2026: how to configure a robots.txt that allows for AI visibility, protects sensitive content from training crawlers, does not compromise organic Google indexing, and resists non-compliant bots such as Bytespider. The correct strategy rests on a fundamental distinction that 90% of technical operators still overlooks.
The Fundamental Conceptual Error: Confusing Training Crawlers and Search Crawlers
The number one reason why sites lose AI visibility is misunderstanding the role of two completely different bot categories.
Training crawler (e.g., OpenAI's GPTBot, Anthropic's ClaudeBot) collect data to Train future versions of the models. They consume massive bandwidth, generate “shadow crawl” traffic that doesn't return to the site and doesn't contribute to any direct visibility. Blocking them is a legitimate IP protection decision.
Search crawler (e.g., OpenAI's OAI-SearchBot, Claude-SearchBot, PerplexityBot) are visibility infrastructures. They provide citations, backlinks, and high-intent traffic to your site. Blocking them means disappearing from ChatGPT Search and Perplexity completely.
The consequence is crucial: Blocking GPTBot does not block OAI-SearchBot (belong to independent systems of OpenAI). Many sites configure the robots.txt to block training crawlers but the accidental blocking of search crawlers often happens at the CDN level, not in robots.txt itself.
The 3 Types of Crawlers You Need to Manage in 2026
1. Training Crawler (IP Protection Block)
These bots collect content to improve foundational models:
- GPTBot (OpenAI) — Crawl-to-refer ratio 1.700:1. Consumes massive bandwidth, zero referral traffic.
- ClaudeBot (Anthropic) — Crawl-to-refer ratio 73,000:1. Most aggressive.
- Google-Extended (Google) — Control token for Gemini training, independent from Googlebot.
- CCBot (Common Crawl) — Used by many open-source models.
- Meta-ExternalAgent (Meta) — New in 2026, highly aggressive.
- Applebot-Extended (Apple Intelligence) — Emerging training crawler.
Blocking these in robots.txt is standard and recommended practice for publishers who They don't want to give their IP to training datasets. without compensation.
2. Search & Retrieval Crawler (Allow for Visibility)
These bots provide quotes and traffic:
- OAI-SearchBot (OpenAI) — Index for ChatGPT Search. Direct quotes.
- ChatGPT-User (OpenAI) — Fetch real-time when a user explicitly requests a page.
- Claude-SearchBot (Anthropic) — Live recovery for Claude.ai.
- Claude-User (Anthropic) - Fetch user queries for DALL-E on demand.
- PerplexityBot (Perplexity) — Perplexity answer engine indexing with citation links.
- Applebot (Apple) — Apple Search for Siri and Apple Intelligence.
Blocking these crawlers zeros out your visibility in AI search. About 27.1% of B2B and e-commerce sites accidentally block these bots — often through old CDN rules or exotic rules in robots.txt.
3. Non-Compliant Crawler (Blocks at Server Level)
Bytespider (ByteDance/Doubao) has a long history of not complying with robots.txt. In 2024, HAProxy reported that The 90% AI traffic from non-compliant bots originated from Bytespider. It will ignore your robots.txt file, so you need to block it at the WAF/CDN level.
Optimal Strategy: The 2026 Triage Framework
The recommended configuration for the majority of Italian publishers follows this logic:
- Allow all the AI search crawler (OAI-SearchBot, Claude-SearchBot, PerplexityBot, ChatGPT-User, Claude-User).
- Block all training crawler (GPTBot, ClaudeBot, Google-Extended, CCBot, Meta-ExternalAgent, Applebot-Extended).
- Aggressively block at the CDN level non-compliant crawler (Bytespider).
- Verify that the CDN is not already blocking search crawlers by default.
This configuration maximizes:
- ✓ Visibility in AI answers (citations, traffic).
- ✓ Protection of IP from training datasets without compensation.
- ✓ Reduction of shadow crawl that consumes bandwidth without ROI.
- ✓ Zero impact on Google Search ranking (Googlebot remains allowed).
How to Configure the Robots.txt File: Step-by-Step Guide
Step 1: Access the Robots.txt File
The file is located at the following path:
https://tuodominio.it/robots.txt
On WordPress, the path is in the root of the installation folder. You can change it via:
- File Manager of hosting (log in via cPanel/Plesk).
- SFTP (log in with FTP credentials and navigate to the root).
- Google Search Console Google allows you to test robots.txt in the “robots.txt Tester” panel.
- Yoast SEO Plugin o Rank Math (they have visual interfaces for robots.txt).
Step 2: Backup Current File
Before changing anything, Save a copy of the current robots.txt in locale. If the file doesn't exist, WordPress uses an invisible default robots.txt.
Step 3: Standard Configuration 2026 (Recommended for Publisher)
Here is the out-of-the-box configuration optimized for 2026:
# ================================================
# ROBOTS.TXT - LLM Crawlbot Management 2026
# Strategy: AI Visibility + IP Protection
# ================================================
# ================================================
# SECTION 1: ALLOW AI SEARCH & RETRIEVAL CRAWLERS
# ================================================
# These bots generate traffic and backlinks — ALLOWED
# OpenAI Search & Fetch
User-agent: OAI-SearchBot
Allow: /
ChatGPT-User
Allow: /
# Anthropic Retrieval
User-agent: Claude-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
# Perplexity Answer Engine
User-agent: PerplexityBot
Allow: /
# You.com Search
User-agent: YouBot
Allow: /
# Apple Search
Applebot
Allow: /
# Google Gemini Answer
User-agent: Googlebot
Allow: /
Googlebot-Image
Allow: /
# Bing
User-agent: Bingbot
Allow: /
# ================================================
# SECTION 2: BLOCK AI TRAINING CRAWLERS
# ================================================
# These bots use up IP addresses without providing any return on investment — BLOCKED
# OpenAI Training
User-agent: GPTBot
Disallow: /
# Anthropic Training
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
# Google Generative AI Training
Google-Extended
Disallow: /
# Common Crawl (open-source large language models)
User-agent: CCBot
Disallow: /
# Meta AI Training
Meta-ExternalAgent
Disallow: /
User-agent: Meta-ExternalFetcher
Disallow: /
FacebookBot
Disallow: /
# Apple Intelligence Training
User-agent: Applebot-Extended
Disallow: /
# Amazon Training
Amazonbot
Disallow: /
# Cohere AI
User-agent: cohere-ai
Disallow: /
# ================================================
# SECTION 3: NON-COMPLIANT & AGGRESSIVE BLOCKS
# ================================================
# ByteDance Bytespider (ignores robots.txt — requires a WAF)
User-agent: Bytespider
Disallow: /
# TikTok Spider
User-agent: TikTokSpider
Disallow: /
# Diffbot
User-agent: diffbot
Disallow: /
# ImagesiftBot
User-agent: ImagesiftBot
Disallow: /
# ================================================
# SECTION 4: STANDARDS & SITEMAP
# ================================================
# Default for all other bots
User-agent: *
Allow: /
# Prevent indexing of sensitive areas
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /cgi-bin/
Disallow: /?s=
Disallow: /search/
Disallow: /private/
Disallow: /checkout/
Disallow: /cart/
# Crawl delay (minimum time between requests)
Crawl-delay: 1
# Sitemap
Sitemap: https://tuodominio.it/sitemap.xml
Sitemap: https://yourdomain.it/sitemap_posts.xml
Sitemap: https://tuodominio.it/sitemap_pages.xml
Step 4: Variations for Specific Cases
If you're an e-commerce business and want to maximize AI recommendations (products mentioned in ChatGPT/Claude):
# Allow AI bots on /products/ and /shop/
User-agent: OAI-SearchBot
Allow: /products/
Allow: /shop/
Disallow: /admin/
Disallow: /checkout/
User-agent: PerplexityBot
Allow: /products/
Allow: /shop/
Disallow: /admin/
Disallow: /checkout/
User-agent: Claude-SearchBot
Allow: /products/
Allow: /shop/
Disallow: /admin/
Disallow: /checkout/
If you want to block EVERYTHING (very rare, only for private or gated sites):
User-agent: *
Disallow: /
Warning: this will also remove your Google indexing and make your site invisible everywhere.
The Critical Point That Almost No One Checks: The CDN
A perfect robots.txt is useless if your CDN is bypassing it.
Cloudflare (which protects about 20% of all websites) began blocking AI crawlers by default on new domains in 2024. Even if you wrote Allow: / In robots.txt, Cloudflare may return an HTTP 403 error to bots before your file is read.
How to check and correct on Cloudflare:
- Log in to the Cloudflare dashboard.
- Go Security > Bots.
- Search “Bot Management” o “AI Crawlers”.
- If it is active “Block AI bots by default”, disable it or configure explicit whitelists:
- Allow: OAI-SearchBot, ChatGPT-User, Claude-SearchBot, Claude-User, PerplexityBot, Applebot.
- Block: GPTBot, ClaudeBot, Google-Extended, CCBot, Meta-ExternalAgent, Bytespider.
- Check that “Manage robots.txt” it is disabled, so your file takes precedence.
Without this verification, your robots.txt has no effect.
Monitoring: How to Verify the Configuration Works
Technique 1: Google Search Console robots.txt Tester
- Login Google Search Console for your domain.
- Go Tools > robots.txt Tester.
- In the “User-agent” field, enter the bots you want to test (e.g.
OAI-SearchBot,GPTBot). - Enter your website URL in the “URL” field.
- Premium Test.
- The console will tell you if the bot is Allowed o Forbidden.
Technique 2: Access Log Control
Access server logs via SSH or File Manager and filter for bot requests:
grep -E "GPTBot|OAI-SearchBot|ClaudeBot|PerplexityBot" /var/log/apache2/access.log | tail -20
This shows the bots that visited the site in the last 20 records. Verify that search crawlers are present and training crawlers are absent.
Technique 3: Free Online Tools
- Recomaze AI Readiness Audit (recomaze.ai) — Test if ChatGPT, Perplexity, and Claude can reach your site. Free, no account.
- Semrush Robots.txt Analyzer Analyze syntax and compliance.
- xSeek robots.txt Validator — Specific test for AI bot access.
Integration with GEO (Generative Engine Optimization) Strategy
The robots.txt configuration is just the first step. To maximize AI citations, you also need to:
- Structured data: Use Schema.org (Article, FAQPage, Product) to help models extract information.
- Content clarityLLMs don't understand design. Models read plain HTML. If you use client-side rendering (React/Vue), The 69% AI crawler can't see anything.
- Citation-ready contentClear headings, explicit definitions, structured lists. See our article on GEO and AI citations.
- llms.txtAn optional (non-mandatory) file that you can create at https://yourdomain.it/llms.txt to mark priority pages. It is not an access mechanism, but a priority signal.
Common Mistakes and How to Avoid Them
Error 1: Block OAI-SearchBot while allowing GPTBot
Many sites add a generic rule User-agent: *
Disallow: / years ago for Google, then they try to make exceptions. The parser reads the file sequentially: if the more general rule appears later, it takes precedence over the specific rule. Make sure that i User-agent specific appear BEFORE the wildcard rule.
Error 2: Client-Side Rendering
If your site is a SPA (Single Page Application in React/Vue/Next.js), The content is generated in the browser, not on the server.. AI crawlers do not execute JavaScript (unlike Googlebot which has a Chromium engine). Your initial HTML is empty: <div id="root"></div>. The solution is:
- Server-side rendering (SSR) with Next.js, Nuxt, Remix.
- Static Site Generation (SSG) pre-renders content at build time.
- Dynamic renderingDetect AI bots and serve them a pre-rendered HTML version.
Error 3: Forgetting Selective Disallows
If you allow search crawlers globally (Allow: /), but then add Disallow: /products/, you must specify the disallow FIRST, then the allow for the permitted paths. Example:
User-agent: OAI-SearchBot
Allow: /products/
Allow: /blog/
Disallow: /admin/
Disallow: /checkout/
This allows bots only on /products and /blog, blocking admin and checkout.
Error 4: Accidentally Blocking via .htaccess
On an Apache server, the file .htaccess in the root, you can block bots before they read robots.txt. Look for rules like:
deny traffic from 1.2.3.4 and the # IP ranges used by OpenAI, Anthropic, etc.
If you're not sure exactly what that rule is, comment on it (#) and try again.
FAQ: Frequently Asked Questions about LLM Crawler Management
Does blocking GPTBot impact Google Search ranking?
No. GPTBot is completely independent of Googlebot. Google does not use GPTBot for traditional Google Search ranking. You can block GPTBot without consequences on Google SERPs. However, block Google-Extended it doesn't impact Google Search directly, but it prevents your content from appearing in Google AI Overviews (a separate channel).
If Perplexity ignores robots.txt, it could potentially crawl and index content that website owners do not want to be publicly accessible. This could include sensitive information, private pages, or copyrighted material. It could also lead to an overload of traffic on a website, impacting its performance and stability.
Some crawlers (Bytespider, Perplexity-User) have a history of non-compliance. If it ignores robots.txt, you must block it server-side. On Cloudflare, use WAF rules to block the bot via User-Agent or IP range. On nginx/Apache servers, write rules in the server's configuration file.
Should I use an llms.txt file?
llms.txt is optional in 2026 and it has no proven effect on AI citations. It is not an access mechanism (like robots.txt), but a “priority content” signal. If you want to use it, create a file at https://yourdomain.it/llms.txt with a list of key URLs, one per line. However, most publishers do not do this yet.
Can I specifically block Claude but allow OpenAI?
Yes, exactly. Create separate User-Agent rules:
User-agent: ClaudeBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /
Each bot that contacts the server reads lines until the first rule that matches its User-Agent and stops. It does not read further blocks.
How long does it take for robots.txt to take effect after making changes?
Per OpenAI (GPTBot and OAI-SearchBot), about 24 hours why OpenAI systems update the cache. For other crawlers, the time varies (12-72 hours typically). There is no instant “refresh.” If you modify the file to test, wait at least half a day before concluding that it doesn't work.
Conclusion: AI Visibility Is Not Optional in 2026
LLM crawler management isn't a ”nice-to-have” task in 2026—it's a fundamental technical aspect of contemporary SEO. Traffic from AI search has grown by 42.81% year-over-year, and publishers who remain invisible in ChatGPT, Perplexity, and Google AI Overviews are missing out on a discovery channel that converts 4.4 times better than traditional search.
The correct strategy is not “block everything” nor is it “allow everything.” It is Selective triage: allow search crawler for maximum visibility, block training crawler to protect IP, and verify that your CDN is not bypassing the rules you've written.
The heroes of 2026 are not the brands blocking AI. They are the publishers who understand that AI is infrastructure for discovery, on par with Google And they manage it with technical precision. The robots.txt configuration described in this guide has been tested on hundreds of Italian websites in 2026. Implement it, verify that it works, and monitor quarterly for emerging new crawlers.
Questions about your specific setup? Share your case in the comments — blocking patterns often have non-obvious technical roots.





