What is robots.txt?

robots.txt is a plain-text file placed at the root of a domain (e.g., example.com/robots.txt) that instructs web crawlers which paths they're allowed or disallowed to request. It follows the Robots Exclusion Protocol, originally proposed by Martijn Koster in 1994 and formalized as an internet standard in RFC 9309 (September 2022). Every major search engine—Google, Bing, Yandex—checks this file before crawling. The file uses simple directives like User-agent, Disallow, Allow, and Sitemap. It's important to understand that robots.txt is a polite request, not an access control mechanism; well-behaved bots honor it, but malicious scrapers ignore it. We use robots.txt on virtually every project we ship to manage crawl budget, keep staging paths out of indexes, and point crawlers to our XML sitemap.

How it works

When a crawler (like Googlebot) visits a domain for the first time, it requests /robots.txt before anything else. The file is cached—Google caches it for roughly 24 hours—and its directives apply to all subsequent requests in that window.

Here's a typical file:

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /staging/
Allow: /api/public/

User-agent: GPTBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

Key rules:

User-agent specifies which crawler the following directives apply to. * means all bots.
Disallow blocks a path. An empty Disallow: line means "allow everything."
Allow overrides a broader Disallow for a specific sub-path. Google supports this; not all crawlers do.
Sitemap tells crawlers where your XML sitemap lives. You can list multiple.
Lines starting with # are comments.

Matching is path-prefix based. Disallow: /admin/ blocks /admin/login, /admin/settings, etc. Google also supports * (wildcard) and $ (end-of-URL anchor) in paths, though these aren't part of the original spec.

The file must be UTF-8 encoded, served with a 200 status code, and located at the exact root path. A 404 on robots.txt means the crawler assumes everything is allowed. A 5xx means Google treats the site as fully disallowed for up to 30 days.

When to use it

robots.txt is about crawl management, not content removal. Use it to shape how bots spend their time on your site.

Use robots.txt when:

You want to prevent crawling of admin panels, internal search result pages, or API endpoints
You need to preserve crawl budget on large sites (100k+ pages) by blocking low-value faceted navigation
You want to block specific AI training bots like GPTBot or CCBot from scraping your content
You need to point crawlers to your sitemap(s)

Don't use robots.txt when:

You want to remove a page from Google's index — use a noindex meta tag or X-Robots-Tag header instead. Disallowing a URL actually prevents Google from seeing the noindex directive, which can keep the page indexed longer.
You're trying to hide sensitive content — robots.txt is publicly readable; anyone can visit /robots.txt
You need authentication-level access control — use proper auth, not crawler directives

robots.txt vs alternatives

Mechanism	Scope	Effect	Enforcement
robots.txt	Crawl-level	Prevents crawling of paths	Advisory (bots can ignore)
noindex meta tag	Page-level	Removes page from index	Enforced once crawled
X-Robots-Tag header	Response-level	Same as noindex but for non-HTML (PDFs, images)	Enforced once crawled
Sitemap	Site-level	Suggests pages TO crawl	Advisory
HTTP 401/403	Server-level	Blocks access entirely	Enforced by server

The most common mistake we see: using Disallow to try to deindex pages. That's backwards. If a page is already indexed and you block it in robots.txt, Google can't recrawl it to discover a noindex tag. The page may stay in search results indefinitely, just with a degraded snippet. Use noindex for deindexing, robots.txt for crawl budget.

Real-world example

We worked on a Next.js e-commerce site with 300k+ product pages and heavy faceted filtering. The filtered URLs (/shoes?color=red&size=10) were generating millions of crawlable permutations, and Google was spending most of its crawl budget on those junk URLs instead of actual product pages.

We added this to robots.txt:

User-agent: *
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?sort=

Within two weeks, Google Search Console showed crawl requests to product pages increased by 4x, and 12,000 previously undiscovered product pages entered the index. The site saw a 23% increase in organic impressions over the following month. We also added a GPTBot disallow after the client decided they didn't want content used for LLM training.

Frequently asked questions about robots.txt

Is robots.txt the same as noindex?

No, they do different things. robots.txt controls whether a crawler can *request* a URL at all. A noindex directive (either as a meta tag or X-Robots-Tag HTTP header) tells a crawler that's already accessed the page not to add it to the search index. In fact, they can conflict: if you disallow a URL in robots.txt, Google can't crawl it, which means it can never see a noindex tag on that page. If the URL was previously indexed, it may remain in search results. For deindexing, always use noindex, not robots.txt.

When did robots.txt become a standard?

The Robots Exclusion Protocol was first proposed by Martijn Koster in June 1994 on the www-talk mailing list, and it became a de facto convention almost immediately. For nearly 28 years it operated as an informal standard. In September 2022, the IETF published RFC 9309, which formally standardized the protocol. Google, which had extended the spec with wildcard and anchor pattern matching, contributed heavily to that RFC. So while robots.txt has been universally supported since the mid-90s, it only became an official internet standard in 2022.

What's the alternative to robots.txt?

It depends on what you're trying to accomplish. For preventing indexing, use a `noindex` meta tag or the `X-Robots-Tag` HTTP header. For blocking access entirely, use server-side authentication (401/403 responses). For guiding crawlers to your important content, use an XML sitemap. For controlling how your snippet appears in search, use `max-snippet` and `max-image-preview` meta robots directives. robots.txt is specifically for managing crawl behavior at scale, and for that particular job, there isn't a direct replacement — it's the standard mechanism.

Can robots.txt block AI crawlers like GPTBot?

Yes. OpenAI's GPTBot, Google's Google-Extended, Anthropic's ClaudeBot (previously anthropic-ai), and Common Crawl's CCBot all respect robots.txt directives. You can add `User-agent: GPTBot` followed by `Disallow: /` to block OpenAI's crawler from your entire site. As of early 2026, most major AI companies honor robots.txt for their training data crawlers. However, this is voluntary compliance — there's no technical enforcement. We've been adding AI bot blocks to robots.txt on client sites since mid-2023, and it's become a standard part of our deployment checklist.