Robots.txt Generator

Share:

Build a valid robots.txt with allow/disallow rules per bot, sitemap pointers, and ready-made presets for common stacks.

RT-SEO-008 · SEO & Marketing

Robots.txt Generator

Sitemap URLs

    Generated robots.txt
    
            
    Advertisement
    After results · AD-W1 Responsive · Post-tool — peak engagement

    How to use the robots.txt generator

    Start with a preset (optional)

    If your site is WordPress, Laravel, or you want to block all AI training bots, click the matching preset. The rule groups, sitemap field, and allow/disallow lists fill in for you. From there, tweak by hand.

    Add rules per user-agent

    Each user-agent group starts with one bot (or * for all). Type a path under Allow or Disallow and press Enter to add it. Paths are case-sensitive on most servers — write /Admin/ if that's how your folder is named.

    Add your sitemap URL

    Paste the full sitemap URL (https included) in the Sitemap field. You can add multiple sitemaps; each gets its own Sitemap: line. Sitemaps belong at the bottom and apply globally regardless of user-agent groups.

    Copy or download

    Click Copy to grab the text, or Download to save as robots.txt. Upload to the root of your domain — it must live at example.com/robots.txt, never in a subfolder.

    Advertisement
    After how-to · AD-W2 Responsive

    Robots.txt — the polite handshake between you and the crawlers

    The robots.txt file is the oldest still-standardised contract on the web. It was drafted in 1994 by Martijn Koster after a runaway crawler accidentally took down his employer's server, and the format has barely changed in three decades. The fundamentals are simple: a plain-text file at the root of your domain, listing which user-agents are allowed (or not allowed) into which folders. Every well-behaved crawler — Googlebot, Bingbot, GPTBot, ClaudeBot — fetches /robots.txt before it touches anything else and obeys what it finds. Misbehaved scrapers ignore it, but those are a different problem solved at the firewall layer.

    What robots.txt is not

    Robots.txt is a request, not a wall. It does not block access — it asks crawlers to skip certain paths. A page disallowed in robots.txt can still appear in Google search results (as a URL-only entry without snippet) if other sites link to it. To genuinely hide content, you need authentication, a noindex meta tag, or HTTP-level blocking. Use robots.txt to manage crawl budget (don't waste Googlebot's time on /admin) and to express preferences (please don't train your model on my content), not as a security boundary.

    The most common robots.txt mistake is blocking /wp-admin/ on a WordPress site without also allowing /wp-admin/admin-ajax.php — which breaks half the dynamic features that depend on AJAX calls.

    AI crawlers — the new arrivals

    In the past two years, the robots.txt landscape has gained a new class of visitor: AI training crawlers. GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google's Bard/Gemini opt-out token), and CCBot (Common Crawl, which feeds many open-source models) all respect robots.txt. If you want your content excluded from future model training, the Block AI bots preset adds the right Disallow lines. This is now a meaningful editorial decision — major publishers (Reuters, BBC, NYT) have made the call, and so should you.

    The APAC search-engine landscape

    Across APAC, Google dominates organic search in most countries — Singapore, Malaysia, Indonesia, Philippines, Thailand, Vietnam, India, Australia, Taiwan, Hong Kong — but several markets have heavyweight local players you should explicitly handle. China is Baidu's home turf (60%+ market share), so Baiduspider is non-negotiable for Chinese-language SEO; Sogou and 360Spider are smaller but worth allowing. South Korea sees Naver hold around 55% of search, served by NaverBot and the older Yeti agent. Japan is split: Yahoo Japan (powered by Google) and Google direct, plus Slurp for Yahoo's legacy index. Russia's Yandex (YandexBot) matters if you have Russian-speaking visitors anywhere in APAC, and Yandex is the only major engine that still respects the Host: directive — useful when you have www and non-www variants and want to declare a canonical host inside robots.txt itself.

    Crawl-delay — sparingly

    The Crawl-delay directive asks crawlers to wait N seconds between requests. Bingbot, YandexBot, and most secondary crawlers respect it. Googlebot does not — Google's position is that crawl rate should be controlled via Search Console settings, not robots.txt. Set a Crawl-delay only if a specific bot is overloading your server; the default behaviour of major search engines is usually fine for shared hosting and definitely fine for a VPS or anything bigger.

    Validating after you deploy

    After uploading your robots.txt, paste it through Google Search Console's robots.txt Tester and Bing Webmaster Tools' equivalent. Both will tell you whether specific URLs are allowed or blocked under your current rules. Re-test every time you change a Disallow line — the cost of accidentally blocking / for Googlebot is a complete deindex within 48 hours, and recovery takes weeks. Always keep the previous robots.txt in version control so you can roll back in seconds.

    10 Things You Didn't Know About Robots.txt

    01

    Robots.txt was created in 1994 by Martijn Koster after a Web Crawler accidentally crashed his employer's server with too-many requests.

    02

    The original Robots Exclusion Protocol was never an RFC — Google, Microsoft, and Yahoo finally proposed it as RFC 9309 in 2022, almost 30 years later.

    03

    Google ignores files larger than 500 KB. Anything past that limit is silently truncated.

    04

    The Crawl-delay directive is honoured by Bing and Yandex but explicitly ignored by Google — Google argues rate should be set in Search Console.

    05

    NASA's robots.txt blocks Bing entirely from /audience/foreducators/ for reasons NASA has never publicly explained.

    06

    Baidu's Baiduspider ignores robots.txt for the first crawl of a brand-new domain — it has to know what exists before it can be told what to skip.

    07

    Google added Google-Extended in September 2023 specifically as a robots.txt token publishers could use to opt out of Bard/Gemini training while keeping search inclusion.

    08

    Allow rules override Disallow rules when both match a URL, but only if the Allow line is more specific (longer path prefix).

    09

    The robots.txt must live at the protocol + host + port root — https://example.com:8080/robots.txt is independent from https://example.com/robots.txt.

    10

    Yandex's robots.txt parser is the only major engine that still respects the Host: directive for declaring a canonical hostname.

    FAQ

    • At the root of your domain — https://example.com/robots.txt. Crawlers will not check any other location. Subdomains need their own file (e.g. blog.example.com/robots.txt).

    • No — it's a polite request. Well-behaved crawlers (Google, Bing, OpenAI's GPTBot, Anthropic's ClaudeBot) obey it. Malicious scrapers ignore it. To genuinely block access you need server-level auth, IP blocking, or a Web Application Firewall.

    • Yes — Google may index a URL it cannot crawl, showing it as a URL-only entry. To prevent listing entirely, allow the page in robots.txt but add <meta name="robots" content="noindex"> in the HTML head.

    • Add Disallow rules for GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Bard/Gemini), and CCBot (Common Crawl). Use the Block AI bots preset above to apply all four at once. PerplexityBot is now standard practice to add as well.

    • When both match a URL, the more specific (longer) rule wins. So Allow: /folder/page.html overrides Disallow: /folder/. This is how WordPress users keep /wp-admin/admin-ajax.php open while blocking the rest of /wp-admin/.

    • Yes on Linux/Unix servers (which is most of the web). Disallow: /admin won't block requests to /Admin. Always match the actual file-system casing of your URLs.

    • Only if you want to allow text crawling but block image indexing (or vice versa). Rules under User-agent: * apply to Googlebot-Image too unless you create a specific group for it that overrides them.

    • No — Google ignores Crawl-delay entirely. To control Google's crawl rate, use the Settings → Crawl rate control in Google Search Console. Bing, Yandex, and most other engines do honour it.

    • Yes if your site has multiple sitemap files — common for large sites split by content type (sitemap-posts.xml, sitemap-products.xml). Or use a sitemap index file and point to that single URL. Either approach is valid.

    • Paste it into Google Search Console's robots.txt Tester and Bing Webmaster Tools' equivalent. Both let you check whether a specific URL would be allowed or blocked under your rules. Always test after every change.

    Related News

    You may be interested in these recent stories from our newsroom.

    View all news →
    Advertisement
    Pre-footer · AD-W3 728 × 90

    75 more free tools

    Calculators, converters, security tools — no signup.