What is robots.txt?
robots.txt is a plain-text file at a site's root that tells web crawlers which URLs they may or may not request.
What is robots.txt?
robots.txt is a plain-text file placed at the root of a domain (e.g., example.com/robots.txt) that instructs web crawlers which paths they're allowed or disallowed to request. It follows the Robots Exclusion Protocol, originally proposed by Martijn Koster in 1994 and formalized as an internet standard in RFC 9309 (September 2022). Every major search engine—Google, Bing, Yandex—checks this file before crawling. The file uses simple directives like User-agent, Disallow, Allow, and Sitemap. It's important to understand that robots.txt is a polite request, not an access control mechanism; well-behaved bots honor it, but malicious scrapers ignore it. We use robots.txt on virtually every project we ship to manage crawl budget, keep staging paths out of indexes, and point crawlers to our XML sitemap.
How it works
When a crawler (like Googlebot) visits a domain for the first time, it requests /robots.txt before anything else. The file is cached—Google caches it for roughly 24 hours—and its directives apply to all subsequent requests in that window.
Here's a typical file:
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /staging/
Allow: /api/public/
User-agent: GPTBot
Disallow: /
Sitemap: https://example.com/sitemap.xml
Key rules:
- User-agent specifies which crawler the following directives apply to.
*means all bots. - Disallow blocks a path. An empty
Disallow:line means "allow everything." - Allow overrides a broader
Disallowfor a specific sub-path. Google supports this; not all crawlers do. - Sitemap tells crawlers where your XML sitemap lives. You can list multiple.
- Lines starting with
#are comments.
Matching is path-prefix based. Disallow: /admin/ blocks /admin/login, /admin/settings, etc. Google also supports * (wildcard) and $ (end-of-URL anchor) in paths, though these aren't part of the original spec.
The file must be UTF-8 encoded, served with a 200 status code, and located at the exact root path. A 404 on robots.txt means the crawler assumes everything is allowed. A 5xx means Google treats the site as fully disallowed for up to 30 days.
When to use it
robots.txt is about crawl management, not content removal. Use it to shape how bots spend their time on your site.
Use robots.txt when:
- You want to prevent crawling of admin panels, internal search result pages, or API endpoints
- You need to preserve crawl budget on large sites (100k+ pages) by blocking low-value faceted navigation
- You want to block specific AI training bots like GPTBot or CCBot from scraping your content
- You need to point crawlers to your sitemap(s)
Don't use robots.txt when:
- You want to remove a page from Google's index — use a
noindexmeta tag or X-Robots-Tag header instead. Disallowing a URL actually prevents Google from seeing the noindex directive, which can keep the page indexed longer. - You're trying to hide sensitive content — robots.txt is publicly readable; anyone can visit
/robots.txt - You need authentication-level access control — use proper auth, not crawler directives
robots.txt vs alternatives
| Mechanism | Scope | Effect | Enforcement |
|---|---|---|---|
| robots.txt | Crawl-level | Prevents crawling of paths | Advisory (bots can ignore) |
| noindex meta tag | Page-level | Removes page from index | Enforced once crawled |
| X-Robots-Tag header | Response-level | Same as noindex but for non-HTML (PDFs, images) | Enforced once crawled |
| Sitemap | Site-level | Suggests pages TO crawl | Advisory |
| HTTP 401/403 | Server-level | Blocks access entirely | Enforced by server |
The most common mistake we see: using Disallow to try to deindex pages. That's backwards. If a page is already indexed and you block it in robots.txt, Google can't recrawl it to discover a noindex tag. The page may stay in search results indefinitely, just with a degraded snippet. Use noindex for deindexing, robots.txt for crawl budget.
Real-world example
We worked on a Next.js e-commerce site with 300k+ product pages and heavy faceted filtering. The filtered URLs (/shoes?color=red&size=10) were generating millions of crawlable permutations, and Google was spending most of its crawl budget on those junk URLs instead of actual product pages.
We added this to robots.txt:
User-agent: *
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?sort=
Within two weeks, Google Search Console showed crawl requests to product pages increased by 4x, and 12,000 previously undiscovered product pages entered the index. The site saw a 23% increase in organic impressions over the following month. We also added a GPTBot disallow after the client decided they didn't want content used for LLM training.