Glossary

What Is robots.txt?

robots.txt is a plain-text file at the root of a website (e.g. example.com/robots.txt) that tells web crawlers which URLs the site owner does or doesn't want them to access, following the Robots Exclusion Protocol.

Understand the Robots Exclusion Protocol, what robots.txt actually enforces (and doesn't), and how scrapers should treat it for both legal and operational reasons.

Explained

robots.txt is a convention dating back to 1994: a plain-text file at the root of a domain that lists which user-agents (crawlers) are allowed or disallowed from accessing which URL paths. Search engines (Google, Bing) and well-behaved crawlers honor it. The file is publicly readable — you can fetch any site's robots.txt directly (`curl https://example.com/robots.txt`) and read what the site owner wants crawlers to do.

The Robots Exclusion Protocol is purely advisory. The web server doesn't enforce robots.txt directives — it just publishes them. A misbehaving crawler can ignore robots.txt and the site will serve the requested URLs anyway. The enforcement comes from search engine policy (Googlebot will not index pages disallowed in robots.txt) and from legal / reputational consequences when commercial scrapers ignore explicit owner intent.

For commercial scraping, robots.txt matters in two ways. Legally, ignoring an explicit `Disallow` directive can be cited as evidence of bad-faith access in CFAA / computer-misuse cases (especially in the US after the hiQ v LinkedIn ruling). Operationally, sites that disallow specific paths often have stronger anti-bot protection on those paths — `Disallow: /search` usually means scraping `/search` will trigger heavier defenses than scraping the rest of the site.

How It Works

When a crawler starts on a domain, it fetches `/robots.txt` first and parses the directives. The file uses a simple grammar: `User-agent: ` blocks specify which crawler the following rules apply to (`*` for all), `Disallow: ` blocks specific URL prefixes, `Allow: ` re-allows paths within a disallowed prefix, and optional `Crawl-delay: ` and `Sitemap: ` directives provide metadata.

The crawler then checks each URL it considers fetching against the rules and skips disallowed paths. Some crawlers cache robots.txt for 24 hours; mature production crawlers refresh more frequently to catch changes.

Types

Disallow

Tells crawlers not to access URLs matching the prefix. `Disallow: /admin/` excludes everything under /admin/. The most common directive.

Allow

Re-allows specific paths within a disallowed prefix. `Disallow: /private/` + `Allow: /private/public-data/` excludes /private/ but allows /private/public-data/.

User-agent

Targets specific crawlers. `User-agent: Googlebot` applies the following rules to Google's crawler only. `User-agent: *` applies to all crawlers as a fallback.

Crawl-delay

Suggests a minimum delay (in seconds) between requests. Honored by Bing and Yandex; Google ignores it (use Search Console crawl-rate setting instead).

Sitemap

Points to the site's XML sitemap(s). `Sitemap: https://example.com/sitemap.xml`. Helps crawlers discover all the URLs the site wants indexed.

Common Use Cases

Telling search engines which pages not to index
Preventing crawlers from hammering expensive endpoints
Pointing crawlers to the canonical sitemap
Excluding admin / private paths from public indexing
Specifying crawl-delay for cooperative crawlers
Documenting site-wide crawler policy
FAQ

Frequently asked FAQ questions

Common questions about robots.txt.

Not by itself. robots.txt is advisory — the server doesn't enforce it, and a non-cooperating crawler can ignore it. However, courts (especially in the US under CFAA) have cited explicit Disallow directives as evidence of the site owner's intent, which weakens scrapers' arguments in computer-misuse cases. Treat it as the legal floor of compliant scraping.