集成

将Shifter与以下工具配合使用 Beautiful Soup

将Shifter的住宅代理和ISP代理与Beautiful Soup配合,实现简洁、表达力强的Python抓取。Beautiful Soup负责HTML解析,Shifter负责提供住宅IP——无需无头浏览器。

快速入门

安装

pip install beautifulsoup4 requests lxml

基本用法

import requests
from bs4 import BeautifulSoup

proxy_url = "customer-USERNAME-country-us-sid-123ABC:PASSWORD@p.shifter.io:443"
proxies   = {"http": proxy_url, "https": proxy_url}

response = requests.get("https://example.com", proxies=proxies, timeout=30)
soup = BeautifulSoup(response.text, "lxml")

print(soup.title.string)
for article in soup.select("article.post"):
    print(article.h2.text.strip(), "->", article.a["href"])

功能特性

与requests、httpx、aiohttp及任何支持proxies字典的Python HTTP客户端无缝配合
默认按请求轮换,使用`sid`实现粘性会话,使用`ttl-N`实现N秒定时固定
兼容bs4 4.x和Python 3.7+——支持lxml和html.parser两种后端
通过用户名参数在 195+ 个国家/地区进行地理定向 — country、region、city、ASN
对于静态或轻JS目标,速度比无头浏览器抓取快一个数量级
可直接用于 Scrapy、FastAPI 爬虫、Airflow 任务、AWS Lambda 及任何 Python 数据管道

示例

粘性会话 + 多页爬取

通过在代理用户名中添加`sid-XXX`,为整个分页爬取固定一个住宅IP。添加`country-uk`和`city-london`进行地理定向。

import requests
import secrets
from bs4 import BeautifulSoup
from urllib.parse import urljoin

sid = secrets.token_hex(4)

proxy_url = (
    f"customer-USERNAME-country-uk-city-london-sid-{sid}-ttl-300:"
    f"PASSWORD@p.shifter.io:443"
)

# Use a session so connection pooling and cookies persist across requests.
session = requests.Session()
session.proxies = {"http": proxy_url, "https": proxy_url}
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
    "Accept-Language": "en-GB,en;q=0.9",
})

products  = []
url       = "https://example.co.uk/products"

while url:
    response = session.get(url, timeout=30)
    soup     = BeautifulSoup(response.text, "lxml")

    for card in soup.select(".product-card"):
        products.append({
            "title": card.select_one("h2").text.strip(),
            "price": card.select_one(".price").text.strip(),
            "url":   urljoin(url, card.select_one("a")["href"]),
        })

    next_link = soup.select_one("a.next-page")
    url       = urljoin(url, next_link["href"]) if next_link else None

print(f"Scraped {len(products)} products")

使用concurrent.futures并行抓取

去掉sid以实现按请求轮换。ThreadPoolExecutor + requests + Shifter可扩展至数十个并发请求,而不会触发单IP速率限制。

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

# No sid -> every request gets a different residential IP.
PROXY_URL = "customer-USERNAME-country-us:PASSWORD@p.shifter.io:443"

def scrape(url: str) -> dict:
    response = requests.get(
        url,
        proxies={"http": PROXY_URL, "https": PROXY_URL},
        headers={"User-Agent": "Mozilla/5.0 AppleWebKit/537.36"},
        timeout=30,
    )
    soup = BeautifulSoup(response.text, "lxml")

    return {
        "url":   url,
        "title": (soup.title.string or "").strip(),
        "h1":    [h.text.strip() for h in soup.select("h1")],
        "links": [a["href"] for a in soup.select("a[href]")[:20]],
    }

urls = [
    "https://example.com/category/laptops",
    "https://example.com/category/phones",
    "https://example.com/category/tablets",
    "https://example.com/category/wearables",
    # ... hundreds more
]

with ThreadPoolExecutor(max_workers=16) as pool:
    futures = {pool.submit(scrape, u): u for u in urls}
    for f in as_completed(futures):
        try:
            result = f.result()
            print(result["url"], "->", result["title"])
        except Exception as exc:
            print("error:", futures[f], exc)

带重试和退避的健壮爬取

生产环境抓取需要对5xx和连接错误进行重试。将urllib3 Retry与Shifter结合,每次重试使用新的sid,以应对临时封禁。

import requests
import secrets
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

class ShifterClient:
    """requests.Session that rotates the residential IP on retry."""

    def __init__(self, country="us"):
        self.country = country
        self._session = requests.Session()

        retry = Retry(
            total=5,
            backoff_factor=1.5,
            status_forcelist=[429, 500, 502, 503, 504],
            allowed_methods=["GET", "POST", "HEAD"],
        )
        adapter = HTTPAdapter(max_retries=retry, pool_connections=20)
        self._session.mount("http://",  adapter)
        self._session.mount("https://", adapter)

    def _proxy(self) -> str:
        sid = secrets.token_hex(4)
        return (
            f"customer-USERNAME-country-{self.country}-sid-{sid}:"
            f"PASSWORD@p.shifter.io:443"
        )

    def get(self, url: str, **kwargs) -> requests.Response:
        return self._session.get(
            url,
            proxies={"http": self._proxy(), "https": self._proxy()},
            timeout=kwargs.pop("timeout", 30),
            **kwargs,
        )

client   = ShifterClient(country="de")
response = client.get("https://example.de/products")
soup     = BeautifulSoup(response.text, "lxml")

for product in soup.select(".product"):
    print(product.h2.text.strip(), product.select_one(".price").text.strip())

httpx(异步)+ Beautiful Soup

如果需要对数千个页面进行异步并发,将requests替换为httpx。相同的Shifter URL,原生async/await,完全兼容Beautiful Soup。

# pip install httpx beautifulsoup4 lxml
import asyncio
import httpx
from bs4 import BeautifulSoup

PROXY = "customer-USERNAME-country-fr-sid-789GHI:PASSWORD@p.shifter.io:443"

async def fetch(client: httpx.AsyncClient, url: str) -> dict:
    resp = await client.get(url, timeout=30)
    soup = BeautifulSoup(resp.text, "lxml")
    return {
        "url":      url,
        "title":    (soup.title.string or "").strip(),
        "headings": [h.text.strip() for h in soup.select("h2")],
    }

async def main():
    async with httpx.AsyncClient(proxy=PROXY) as client:
        urls = [
            f"https://example.fr/products?page={i}" for i in range(1, 51)
        ]
        results = await asyncio.gather(*[fetch(client, u) for u in urls])

    for r in results:
        print(r["url"], "->", r["title"])

asyncio.run(main())
常见问题

常见问题

关于将 Shifter 与 Beautiful Soup 搭配使用的常见问题。

不需要。Beautiful Soup 是一个解析器,本身不发起 HTTP 请求。代理需在与 bs4 配合使用的 HTTP 客户端上配置(requests、httpx、aiohttp、urllib)。通过 Shifter 获取 HTML 后,像往常一样将其传入 BeautifulSoup() 即可。

立即开始

开始将Shifter与以下工具配合使用 Beautiful Soup

将 Shifter 的 205M+ 住宅及 ISP 代理与 Beautiful Soup 结合,实现简洁、表达力强的 Python 抓取。支持按请求轮换、粘性会话,以及通过 httpx 实现完整异步支持。

免费试用 Shifter几分钟内完成设置,随时可取消。