Explained
A headless browser is a normal browser (Chrome, Firefox, WebKit) that runs without a visible window, exposing a programmatic API for navigation, DOM interaction, and rendering. It executes JavaScript, applies CSS, handles cookies, and behaves indistinguishably from a regular browser — the only difference is that there's no UI window and no human input.
For scraping, headless browsers are necessary whenever the data you want is rendered client-side by JavaScript. Modern web apps (React, Vue, Angular SPAs) often return an empty HTML shell that gets populated after the JS runs; a plain HTTP client like `requests` or `axios` would only see the empty shell. A headless browser executes the full page lifecycle and gives you the fully rendered DOM.
The major drivers are Playwright (Microsoft, multi-browser), Puppeteer (Google, Chrome/Firefox), and Selenium (older WebDriver-based, broadest language support). Each gives you methods to navigate, click, type, wait for elements, intercept network requests, and extract content. For automation work that needs to look human, headless browsers paired with stealth plugins and residential proxies are the standard stack.
How It Works
When you launch a headless browser via Playwright/Puppeteer/Selenium, the driver starts a real Chromium (or Firefox/WebKit) process with the `--headless` flag and connects to it over a debugging protocol (Chrome DevTools Protocol for Playwright/Puppeteer, WebDriver for Selenium). Your script sends commands over that protocol — `page.goto`, `page.click`, `page.evaluate` — and the browser executes them as if a human user were driving.
The browser handles all the things a real browser does: TLS handshake (with its own fingerprint), HTTP/2 or HTTP/3 negotiation, cookie storage, JavaScript execution, layout, paint, network requests for sub-resources. Your script can intercept any of those, modify requests/responses, inject scripts, and extract data from the rendered DOM.