集成
将Shifter与以下工具配合使用 Beautiful Soup
将Shifter的住宅代理和ISP代理与Beautiful Soup配合,实现简洁、表达力强的Python抓取。Beautiful Soup负责HTML解析,Shifter负责提供住宅IP——无需无头浏览器。
快速入门
安装
pip install beautifulsoup4 requests lxml 基本用法
import requests
from bs4 import BeautifulSoup
proxy_url = "customer-USERNAME-country-us-sid-123ABC:PASSWORD@p.shifter.io:443"
proxies = {"http": proxy_url, "https": proxy_url}
response = requests.get("https://example.com", proxies=proxies, timeout=30)
soup = BeautifulSoup(response.text, "lxml")
print(soup.title.string)
for article in soup.select("article.post"):
print(article.h2.text.strip(), "->", article.a["href"]) 功能特性
与requests、httpx、aiohttp及任何支持proxies字典的Python HTTP客户端无缝配合
默认按请求轮换,使用`sid`实现粘性会话,使用`ttl-N`实现N秒定时固定
兼容bs4 4.x和Python 3.7+——支持lxml和html.parser两种后端
通过用户名参数在 195+ 个国家/地区进行地理定向 — country、region、city、ASN
对于静态或轻JS目标,速度比无头浏览器抓取快一个数量级
可直接用于 Scrapy、FastAPI 爬虫、Airflow 任务、AWS Lambda 及任何 Python 数据管道
示例
粘性会话 + 多页爬取
通过在代理用户名中添加`sid-XXX`,为整个分页爬取固定一个住宅IP。添加`country-uk`和`city-london`进行地理定向。
import requests
import secrets
from bs4 import BeautifulSoup
from urllib.parse import urljoin
sid = secrets.token_hex(4)
proxy_url = (
f"customer-USERNAME-country-uk-city-london-sid-{sid}-ttl-300:"
f"PASSWORD@p.shifter.io:443"
)
# Use a session so connection pooling and cookies persist across requests.
session = requests.Session()
session.proxies = {"http": proxy_url, "https": proxy_url}
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Accept-Language": "en-GB,en;q=0.9",
})
products = []
url = "https://example.co.uk/products"
while url:
response = session.get(url, timeout=30)
soup = BeautifulSoup(response.text, "lxml")
for card in soup.select(".product-card"):
products.append({
"title": card.select_one("h2").text.strip(),
"price": card.select_one(".price").text.strip(),
"url": urljoin(url, card.select_one("a")["href"]),
})
next_link = soup.select_one("a.next-page")
url = urljoin(url, next_link["href"]) if next_link else None
print(f"Scraped {len(products)} products") 使用concurrent.futures并行抓取
去掉sid以实现按请求轮换。ThreadPoolExecutor + requests + Shifter可扩展至数十个并发请求,而不会触发单IP速率限制。
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
# No sid -> every request gets a different residential IP.
PROXY_URL = "customer-USERNAME-country-us:PASSWORD@p.shifter.io:443"
def scrape(url: str) -> dict:
response = requests.get(
url,
proxies={"http": PROXY_URL, "https": PROXY_URL},
headers={"User-Agent": "Mozilla/5.0 AppleWebKit/537.36"},
timeout=30,
)
soup = BeautifulSoup(response.text, "lxml")
return {
"url": url,
"title": (soup.title.string or "").strip(),
"h1": [h.text.strip() for h in soup.select("h1")],
"links": [a["href"] for a in soup.select("a[href]")[:20]],
}
urls = [
"https://example.com/category/laptops",
"https://example.com/category/phones",
"https://example.com/category/tablets",
"https://example.com/category/wearables",
# ... hundreds more
]
with ThreadPoolExecutor(max_workers=16) as pool:
futures = {pool.submit(scrape, u): u for u in urls}
for f in as_completed(futures):
try:
result = f.result()
print(result["url"], "->", result["title"])
except Exception as exc:
print("error:", futures[f], exc) 带重试和退避的健壮爬取
生产环境抓取需要对5xx和连接错误进行重试。将urllib3 Retry与Shifter结合,每次重试使用新的sid,以应对临时封禁。
import requests
import secrets
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
class ShifterClient:
"""requests.Session that rotates the residential IP on retry."""
def __init__(self, country="us"):
self.country = country
self._session = requests.Session()
retry = Retry(
total=5,
backoff_factor=1.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET", "POST", "HEAD"],
)
adapter = HTTPAdapter(max_retries=retry, pool_connections=20)
self._session.mount("http://", adapter)
self._session.mount("https://", adapter)
def _proxy(self) -> str:
sid = secrets.token_hex(4)
return (
f"customer-USERNAME-country-{self.country}-sid-{sid}:"
f"PASSWORD@p.shifter.io:443"
)
def get(self, url: str, **kwargs) -> requests.Response:
return self._session.get(
url,
proxies={"http": self._proxy(), "https": self._proxy()},
timeout=kwargs.pop("timeout", 30),
**kwargs,
)
client = ShifterClient(country="de")
response = client.get("https://example.de/products")
soup = BeautifulSoup(response.text, "lxml")
for product in soup.select(".product"):
print(product.h2.text.strip(), product.select_one(".price").text.strip()) httpx(异步)+ Beautiful Soup
如果需要对数千个页面进行异步并发,将requests替换为httpx。相同的Shifter URL,原生async/await,完全兼容Beautiful Soup。
# pip install httpx beautifulsoup4 lxml
import asyncio
import httpx
from bs4 import BeautifulSoup
PROXY = "customer-USERNAME-country-fr-sid-789GHI:PASSWORD@p.shifter.io:443"
async def fetch(client: httpx.AsyncClient, url: str) -> dict:
resp = await client.get(url, timeout=30)
soup = BeautifulSoup(resp.text, "lxml")
return {
"url": url,
"title": (soup.title.string or "").strip(),
"headings": [h.text.strip() for h in soup.select("h2")],
}
async def main():
async with httpx.AsyncClient(proxy=PROXY) as client:
urls = [
f"https://example.fr/products?page={i}" for i in range(1, 51)
]
results = await asyncio.gather(*[fetch(client, u) for u in urls])
for r in results:
print(r["url"], "->", r["title"])
asyncio.run(main()) 常见问题
常见问题
关于将 Shifter 与 Beautiful Soup 搭配使用的常见问题。
不需要。Beautiful Soup 是一个解析器,本身不发起 HTTP 请求。代理需在与 bs4 配合使用的 HTTP 客户端上配置(requests、httpx、aiohttp、urllib)。通过 Shifter 获取 HTML 后,像往常一样将其传入 BeautifulSoup() 即可。
立即开始
开始将Shifter与以下工具配合使用 Beautiful Soup
将 Shifter 的 205M+ 住宅及 ISP 代理与 Beautiful Soup 结合,实现简洁、表达力强的 Python 抓取。支持按请求轮换、粘性会话,以及通过 httpx 实现完整异步支持。
免费试用 Shifter几分钟内完成设置,随时可取消。