A decade ago, “the web” and “the open web” were almost the same thing. You typed a URL, you got HTML, you read it. Search engines indexed it. Researchers cited it. Aggregators surfaced it. There was a public address space and a public protocol and a default assumption that things on it were reachable.
That has been quietly eroded for years. It’s worth saying out loud where it ended up, because the implications for the next generation of data products are large.
What “open” used to mean
The early-2010s open web had three properties that we mostly took for granted:
Reachability. A page had a URL. Fetching the URL returned the page. The page had links to other pages, which also had URLs. The graph was traversable.
Renderability. A request for the URL returned HTML or close enough. You could read it without running JavaScript, or you could run JavaScript and get the same content with cosmetics.
Stability of identity. A page that existed last week probably existed today, at the same URL, with similar content. You could cite it. You could check it.
None of those properties are guaranteed today.
What’s eaten the open web
A few forces, mostly in parallel, mostly with reasonable individual motivations:
Paywalls. News organizations needed revenue. Most major outlets put a hard or metered paywall in front of everything. The content still exists at the URL, but you can’t read it without an account and a credit card. Search engines see a different version than logged-out humans.
Login walls. Social platforms used to be reachable from the open web. You could read a Twitter thread without an account in 2014. Today most of LinkedIn, X, Instagram, Reddit (intermittently), and Facebook are gated. The content is still there, but the URL no longer resolves to it.
App-first. A growing share of content lives in apps that have no web equivalent, TikTok and a long tail of vertical-specific apps. There is no URL. There is no page. The content exists only inside a closed runtime.
Anti-scraping. Even when the URL still works, the content behind it is increasingly defended by Cloudflare, Akamai, and a dozen specialists. The page renders fine for a household visitor and returns a CAPTCHA challenge for everyone else.
Personalization. Pages that look the same actually aren’t. The product listing you see on a major e-commerce site has been personalized to your location, your browsing history, your inferred income bracket, your A/B test bucket. There’s no longer a canonical version. There’s a per-visitor version.
Each of these has reasonable defenders. None is going to reverse. The net effect is that the open web is smaller and more fragmented than it was, and the cost of seeing the public web as a representative dataset has gone up.
Why AI raises the stakes
In the previous era, the open web was mostly consumed by humans reading individual pages and by search engines indexing them. The economic value of any given page was relatively low.
That has changed. Modern AI systems treat the web as their dataset in three distinct ways:
Training. Frontier models are trained on web-scale corpora. The composition of that corpus determines what the model knows and how it answers questions.
Grounding (retrieval-augmented generation). Production systems that need current information, the date, the price, the news, the regulatory update, retrieve from the live web at inference time. The retrieval target is real URLs returning real HTML.
Agentic browsing. A new class of AI systems navigates the web on behalf of users, booking flights, comparing prices, filling out forms. These systems hit real pages and need them to render correctly.
In all three cases, what the open web can or cannot serve directly determines what an AI system can do. If a page is paywalled, behind a login, or rendered only inside an app, it functionally doesn’t exist for the AI layer. If it’s served differently to bots than to humans, the AI sees a degraded version. If it’s blocked at the network layer, the AI sees nothing.
The cost of an incomplete open web compounds at the AI layer because AI multiplies access. One bad scrape used to be a missing data point. Today, the same gap shows up in a million model responses.
What infrastructure has to do with it
Proxy networks are the boring, plumbing-layer answer to one piece of this problem. They don’t fix paywalls, they don’t unlock apps, and they don’t push back on anti-bot vendors. What they do is restore a measure of representative access, let a request originate from a real residential connection, in a real geography, with a real network reputation, so the open web that’s still open actually opens for it.
This is a narrow fix to a much larger problem. We don’t pretend otherwise. The full answer to “how do we keep the open web functional” is going to require regulatory pressure, public infrastructure investments, transparency requirements on personalization and gating, and a general re-litigation of what “publicly available” should mean in the era of AI consumption. None of that is going to be settled this year.
In the meantime, the parts of the web that are still public should actually be reachable, for the systems that need to reach them. That’s the small piece we work on, and that’s the small piece that mostly motivated the latest round of investment in our residential network.
The honest version
A more accessible open web is good for AI grounding, good for research, good for journalism, good for price transparency, good for consumer choice, good for almost everyone except the entities directly capturing rents from gating. The trend line is against it. Nothing we ship is going to reverse the trend.
What we can do, and what the broader infrastructure layer can do, is make sure that the parts of the web that are still public stay reachable for the systems that need them. That’s a smaller goal than “save the open web.” It’s also achievable, and it matters more every year that AI becomes the primary consumer of what the web has to offer.