Robots.txt Generator
Build a valid robots.txt with allow/disallow rules per bot, sitemap pointers, and ready-made presets for common stacks.
Robots.txt Generator
Sitemap URLs
How to use the robots.txt generator
Start with a preset (optional)
If your site is WordPress, Laravel, or you want to block all AI training bots, click the matching preset. The rule groups, sitemap field, and allow/disallow lists fill in for you. From there, tweak by hand.
Add rules per user-agent
Each user-agent group starts with one bot (or * for all). Type a path under Allow or Disallow and press Enter to add it. Paths are case-sensitive on most servers — write /Admin/ if that's how your folder is named.
Add your sitemap URL
Paste the full sitemap URL (https included) in the Sitemap field. You can add multiple sitemaps; each gets its own Sitemap: line. Sitemaps belong at the bottom and apply globally regardless of user-agent groups.
Copy or download
Click Copy to grab the text, or Download to save as robots.txt. Upload to the root of your domain — it must live at example.com/robots.txt, never in a subfolder.
Robots.txt — the polite handshake between you and the crawlers
The robots.txt file is the oldest still-standardised contract on the web. It was drafted in 1994 by Martijn Koster after a runaway crawler accidentally took down his employer's server, and the format has barely changed in three decades. The fundamentals are simple: a plain-text file at the root of your domain, listing which user-agents are allowed (or not allowed) into which folders. Every well-behaved crawler — Googlebot, Bingbot, GPTBot, ClaudeBot — fetches /robots.txt before it touches anything else and obeys what it finds. Misbehaved scrapers ignore it, but those are a different problem solved at the firewall layer.
What robots.txt is not
Robots.txt is a request, not a wall. It does not block access — it asks crawlers to skip certain paths. A page disallowed in robots.txt can still appear in Google search results (as a URL-only entry without snippet) if other sites link to it. To genuinely hide content, you need authentication, a noindex meta tag, or HTTP-level blocking. Use robots.txt to manage crawl budget (don't waste Googlebot's time on /admin) and to express preferences (please don't train your model on my content), not as a security boundary.
The most common robots.txt mistake is blocking/wp-admin/on a WordPress site without also allowing/wp-admin/admin-ajax.php— which breaks half the dynamic features that depend on AJAX calls.
AI crawlers — the new arrivals
In the past two years, the robots.txt landscape has gained a new class of visitor: AI training crawlers. GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google's Bard/Gemini opt-out token), and CCBot (Common Crawl, which feeds many open-source models) all respect robots.txt. If you want your content excluded from future model training, the Block AI bots preset adds the right Disallow lines. This is now a meaningful editorial decision — major publishers (Reuters, BBC, NYT) have made the call, and so should you.
The APAC search-engine landscape
Across APAC, Google dominates organic search in most countries — Singapore, Malaysia, Indonesia, Philippines, Thailand, Vietnam, India, Australia, Taiwan, Hong Kong — but several markets have heavyweight local players you should explicitly handle. China is Baidu's home turf (60%+ market share), so Baiduspider is non-negotiable for Chinese-language SEO; Sogou and 360Spider are smaller but worth allowing. South Korea sees Naver hold around 55% of search, served by NaverBot and the older Yeti agent. Japan is split: Yahoo Japan (powered by Google) and Google direct, plus Slurp for Yahoo's legacy index. Russia's Yandex (YandexBot) matters if you have Russian-speaking visitors anywhere in APAC, and Yandex is the only major engine that still respects the Host: directive — useful when you have www and non-www variants and want to declare a canonical host inside robots.txt itself.
Crawl-delay — sparingly
The Crawl-delay directive asks crawlers to wait N seconds between requests. Bingbot, YandexBot, and most secondary crawlers respect it. Googlebot does not — Google's position is that crawl rate should be controlled via Search Console settings, not robots.txt. Set a Crawl-delay only if a specific bot is overloading your server; the default behaviour of major search engines is usually fine for shared hosting and definitely fine for a VPS or anything bigger.
Validating after you deploy
After uploading your robots.txt, paste it through Google Search Console's robots.txt Tester and Bing Webmaster Tools' equivalent. Both will tell you whether specific URLs are allowed or blocked under your current rules. Re-test every time you change a Disallow line — the cost of accidentally blocking / for Googlebot is a complete deindex within 48 hours, and recovery takes weeks. Always keep the previous robots.txt in version control so you can roll back in seconds.
10 Things You Didn't Know About Robots.txt
Robots.txt was created in 1994 by Martijn Koster after a Web Crawler accidentally crashed his employer's server with too-many requests.
The original Robots Exclusion Protocol was never an RFC — Google, Microsoft, and Yahoo finally proposed it as RFC 9309 in 2022, almost 30 years later.
Google ignores files larger than 500 KB. Anything past that limit is silently truncated.
The Crawl-delay directive is honoured by Bing and Yandex but explicitly ignored by Google — Google argues rate should be set in Search Console.
NASA's robots.txt blocks Bing entirely from /audience/foreducators/ for reasons NASA has never publicly explained.
Baidu's Baiduspider ignores robots.txt for the first crawl of a brand-new domain — it has to know what exists before it can be told what to skip.
Google added Google-Extended in September 2023 specifically as a robots.txt token publishers could use to opt out of Bard/Gemini training while keeping search inclusion.
Allow rules override Disallow rules when both match a URL, but only if the Allow line is more specific (longer path prefix).
The robots.txt must live at the protocol + host + port root — https://example.com:8080/robots.txt is independent from https://example.com/robots.txt.
Yandex's robots.txt parser is the only major engine that still respects the Host: directive for declaring a canonical hostname.
FAQ
-
At the root of your domain —
https://example.com/robots.txt. Crawlers will not check any other location. Subdomains need their own file (e.g.blog.example.com/robots.txt). -
No — it's a polite request. Well-behaved crawlers (Google, Bing, OpenAI's GPTBot, Anthropic's ClaudeBot) obey it. Malicious scrapers ignore it. To genuinely block access you need server-level auth, IP blocking, or a Web Application Firewall.
-
Yes — Google may index a URL it cannot crawl, showing it as a URL-only entry. To prevent listing entirely, allow the page in robots.txt but add
<meta name="robots" content="noindex">in the HTML head. -
Add Disallow rules for GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Bard/Gemini), and CCBot (Common Crawl). Use the Block AI bots preset above to apply all four at once. PerplexityBot is now standard practice to add as well.
-
When both match a URL, the more specific (longer) rule wins. So
Allow: /folder/page.htmloverridesDisallow: /folder/. This is how WordPress users keep/wp-admin/admin-ajax.phpopen while blocking the rest of/wp-admin/. -
Yes on Linux/Unix servers (which is most of the web).
Disallow: /adminwon't block requests to/Admin. Always match the actual file-system casing of your URLs. -
Only if you want to allow text crawling but block image indexing (or vice versa). Rules under
User-agent: *apply to Googlebot-Image too unless you create a specific group for it that overrides them. -
No — Google ignores Crawl-delay entirely. To control Google's crawl rate, use the Settings → Crawl rate control in Google Search Console. Bing, Yandex, and most other engines do honour it.
-
Yes if your site has multiple sitemap files — common for large sites split by content type (sitemap-posts.xml, sitemap-products.xml). Or use a sitemap index file and point to that single URL. Either approach is valid.
-
Paste it into Google Search Console's robots.txt Tester and Bing Webmaster Tools' equivalent. Both let you check whether a specific URL would be allowed or blocked under your rules. Always test after every change.
Related News
You may be interested in these recent stories from our newsroom.
-
Cognition raises US$1 billion at a US$26 billion valuation as Devin clears US$492 million in revenue
The maker of the AI software engineer Devin has raised US$1 billion, valuing the company at US$26 billion after the money. Eight months ago...
-
Singapore tested government AI agents in a sandbox. The hard part was trust, not capability.
Singapore and Google ran AI agents against real government work for four months, on an air-gapped cloud, and published what broke. As someon...
-
OpenAI Foundation commits US$250 million to workers displaced by AI
The OpenAI Foundation has pledged an initial US$250 million for research, grants and programmes to help workers and economies adjust to AI-d...
75 more free tools
Calculators, converters, security tools — no signup.