Explained
robots.txt is a convention dating back to 1994: a plain-text file at the root of a domain that lists which user-agents (crawlers) are allowed or disallowed from accessing which URL paths. Search engines (Google, Bing) and well-behaved crawlers honor it. The file is publicly readable — you can fetch any site's robots.txt directly (`curl https://example.com/robots.txt`) and read what the site owner wants crawlers to do.
The Robots Exclusion Protocol is purely advisory. The web server doesn't enforce robots.txt directives — it just publishes them. A misbehaving crawler can ignore robots.txt and the site will serve the requested URLs anyway. The enforcement comes from search engine policy (Googlebot will not index pages disallowed in robots.txt) and from legal / reputational consequences when commercial scrapers ignore explicit owner intent.
For commercial scraping, robots.txt matters in two ways. Legally, ignoring an explicit `Disallow` directive can be cited as evidence of bad-faith access in CFAA / computer-misuse cases (especially in the US after the hiQ v LinkedIn ruling). Operationally, sites that disallow specific paths often have stronger anti-bot protection on those paths — `Disallow: /search` usually means scraping `/search` will trigger heavier defenses than scraping the rest of the site.
How It Works
When a crawler starts on a domain, it fetches `/robots.txt` first and parses the directives. The file uses a simple grammar: `User-agent:
The crawler then checks each URL it considers fetching against the rules and skips disallowed paths. Some crawlers cache robots.txt for 24 hours; mature production crawlers refresh more frequently to catch changes.