AI-Crawler robots.txt Builder

Share:

Generate a robots.txt that allows or blocks AI crawlers — GPTBot, ClaudeBot, Google-Extended, PerplexityBot and more. Free, runs in your browser.

RT-AI-028 · AI Tools

AI-Crawler robots.txt Builder

Your robots.txt

Advertisement
After tool · AD-W1Responsive · Post-tool

How to Use the AI-Crawler robots.txt Builder

Choose which AI crawlers to block

Flip the toggle next to any crawler you want to keep out — GPTBot, ClaudeBot, Google-Extended, PerplexityBot and more are grouped by company. Use "Block all AI" or "Allow all AI" to set them in one click, then fine-tune.

Decide what happens to everyone else

Set whether all other crawlers — including search engines like Googlebot and Bingbot — are still allowed (recommended, so you stay searchable) or blocked too.

Copy or download

The file builds live as valid robots.txt syntax. Copy it or download robots.txt. Everything runs in your browser — nothing you choose is sent anywhere.

Publish it at your site root

Place the file so it is reachable at https://yoursite.com/robots.txt. If you already have a robots.txt, merge these User-agent blocks into it rather than overwriting your existing rules.

Advertisement
After how-to · AD-W2Responsive

Controlling AI Crawlers with robots.txt

Why AI crawlers got their own user-agents

For decades, robots.txt was mostly about search engines — telling Googlebot and Bingbot which paths to skip. The generative-AI boom changed the question. Companies now run separate crawlers to gather training data and to answer user queries in real time, and many gave those crawlers distinct names so site owners could make a choice. GPTBot and ClaudeBot gather data that can improve models; OAI-SearchBot and Claude-SearchBot build the indexes behind AI search; ChatGPT-User and Claude-User fetch a page because a user asked the assistant about it. Naming them separately means you can, for example, block training while still allowing the user-triggered fetch that sends you referral traffic.

A couple of important nuances. Google-Extended and Applebot-Extended are not crawlers at all — they are opt-out directives. Google and Apple still crawl your site with Googlebot and Applebot for search; the "-Extended" token only signals whether that already-collected content may be used to train Gemini or Apple Intelligence. And CCBot belongs to Common Crawl, an open dataset that a large number of AI models train on indirectly — blocking it cuts off one of the biggest upstream training sources. This builder groups every major crawler by company and labels what each one actually does, so your choices are informed rather than guesswork.

"robots.txt is a request, not a wall. Reputable AI companies honour it; it is a sign on the door, not a lock."

What robots.txt can and can't do

The single most important thing to understand is that robots.txt is advisory. It is a published convention that well-behaved crawlers choose to obey — the major AI companies document that they respect it — but it does not technically prevent anyone from reading your pages. Bad actors and some aggressive scrapers ignore it entirely. If you need real enforcement, that comes from server-side controls: blocking by user-agent or IP at your web server or CDN, rate limiting, or authentication. Think of robots.txt as the front line of a layered approach — the polite, standards-based signal that handles the reputable crawlers cleanly, leaving harder measures for the ones that don't play fair. Pick your crawlers, publish the file, and pair it with firewall rules if your content is sensitive.

10 Facts About AI Crawlers & robots.txt

01

robots.txt is advisory — reputable crawlers honour it, but it does not technically block access.

02

GPTBot (OpenAI) and ClaudeBot (Anthropic) gather data used to train and improve their models.

03

Google-Extended and Applebot-Extended are training opt-out directives, not separate crawlers.

04

CCBot is Common Crawl — an open dataset a huge number of AI models train on indirectly.

05

User-triggered agents like ChatGPT-User and Perplexity-User fetch a page because a person asked about it.

06

Blocking training crawlers but allowing search/user agents lets you stay discoverable while opting out of training.

07

Bytespider (ByteDance) is widely reported as one of the most aggressive AI scrapers on the web.

08

Blocking an AI crawler in robots.txt does not affect your normal Google or Bing search ranking.

09

For real enforcement you need server- or CDN-level rules — robots.txt only handles the polite crawlers.

10

This builder runs entirely in your browser — your choices are never uploaded.

Frequently Asked Questions

  • It stops the well-behaved ones. robots.txt is an advisory standard that the major AI companies document that they respect, so blocking GPTBot or ClaudeBot does keep them out. But it is not technically enforced — aggressive or malicious scrapers can ignore it. For guaranteed blocking, add server- or CDN-level user-agent and IP rules.
  • At the root of your domain so it resolves at https://yoursite.com/robots.txt. If you already have one, merge these User-agent blocks into the existing file rather than replacing it, so you keep your current rules.
  • No. The AI training crawlers are separate from the search crawlers. Blocking GPTBot, ClaudeBot or Google-Extended does not affect Googlebot or Bingbot, so your normal search indexing and ranking are unchanged. Keep "Other crawlers" set to Allow to stay searchable.
  • GPTBot crawls the web to gather data that can improve OpenAI's models. ChatGPT-User fetches a specific page in real time because a user asked ChatGPT about it. Many sites block GPTBot (training) but allow ChatGPT-User, since the latter can send you referral visitors.
  • They are opt-out directives, not separate bots. Google and Apple still crawl your site with Googlebot and Applebot for search. The "-Extended" token only controls whether that content may be used to train their AI (Gemini, Apple Intelligence). Disallowing it is a training opt-out, not a crawl block.
  • CCBot belongs to Common Crawl, an open dataset that many AI models train on indirectly. If your goal is to keep your content out of model training broadly, blocking CCBot closes one of the largest upstream sources. If you want maximum reach and don't mind training use, you can leave it allowed.
  • It covers the major, actively-documented AI user-agents as of mid-2026 — OpenAI, Anthropic, Google, Apple, Perplexity, Common Crawl, Meta, Amazon, ByteDance and Webz.io. New crawlers appear over time, so check each provider's official documentation periodically and add any new tokens to your file.
  • Yes — that's exactly why the crawlers are named separately. Block the training agents (GPTBot, ClaudeBot, CCBot) but leave the user-triggered ones (ChatGPT-User, Claude-User, Perplexity-User) allowed. Then a person can still fetch and discuss your page in an assistant.
  • No. The file is assembled entirely in your browser with plain JavaScript. Your toggle choices are never sent to any server or third party, and nothing is saved.
  • Completely free, with no account or sign-up, and no limit on use. It runs in your browser and collects no data.

Related News

You may be interested in these recent stories from our newsroom.

No related news yet for this tool. Our editorial team publishes new pieces every week.

Browse all news →
Advertisement
Pre-footer · AD-W3 728 × 90

75 more free tools

Calculators, converters, security tools — no signup.