将Shifter与以下工具配合使用 Apify
将Shifter的住宅代理和ISP代理接入任意Apify Actor——Crawlee负责队列管理和重试,Shifter负责提供住宅IP。ProxyConfiguration原生支持Shifter URL。
快速入门
安装
npm install apify crawlee 基本用法
// main.js (an Apify Actor)
import { Actor } from "apify";
import { CheerioCrawler, ProxyConfiguration } from "crawlee";
await Actor.init();
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
"customer-USERNAME-country-us-sid-123ABC:PASSWORD@p.shifter.io:443",
],
});
const crawler = new CheerioCrawler({
proxyConfiguration,
async requestHandler({ request, $, log }) {
log.info(`${request.url} -> ${$("h1").text().trim()}`);
},
});
await crawler.run(["https://example.com"]);
await Actor.exit(); 功能特性
示例
Crawlee + 按会话轮换
Crawlee自动处理封禁和会话过期。使用newUrlFunction为每个会话生成新的Shifter URL——当Crawlee因封禁而终止某个会话时,下一个会话将获得全新的住宅IP。
import { Actor } from "apify";
import { CheerioCrawler, ProxyConfiguration } from "crawlee";
await Actor.init();
const proxyConfiguration = new ProxyConfiguration({
// Each session asks for a fresh URL — and Crawlee bumps the session
// on bans, so stale IPs get cycled out automatically.
newUrlFunction: () => {
const sid = Math.random().toString(36).slice(2, 10);
return `customer-USERNAME-country-uk-sid-${sid}-ttl-300:PASSWORD@p.shifter.io:443`;
},
});
const crawler = new CheerioCrawler({
proxyConfiguration,
useSessionPool: true,
persistCookiesPerSession: true,
maxConcurrency: 8,
async requestHandler({ request, $, enqueueLinks, log, session }) {
log.info(`Session ${session.id} -> ${request.url}`);
$(".product-card").each((_, el) => {
// Push to dataset (auto-persisted by Apify)
Actor.pushData({
url: request.url,
title: $(el).find("h2").text().trim(),
price: $(el).find(".price").text().trim(),
});
});
await enqueueLinks({ selector: "a.next-page", strategy: "same-domain" });
},
failedRequestHandler({ request, log }) {
log.error(`Failed after retries: ${request.url}`);
},
});
await crawler.run(["https://example.co.uk/products"]);
await Actor.exit(); PuppeteerCrawler(JS密集型目标)
当目标需要真实浏览器时,将CheerioCrawler替换为PuppeteerCrawler。相同的ProxyConfiguration即可接入——Crawlee将Shifter URL传入Puppeteer的启动参数。
import { Actor } from "apify";
import { PuppeteerCrawler, ProxyConfiguration } from "crawlee";
await Actor.init();
const proxyConfiguration = new ProxyConfiguration({
newUrlFunction: () => {
const sid = Math.random().toString(36).slice(2, 10);
return `customer-USERNAME-country-de-city-berlin-sid-${sid}:PASSWORD@p.shifter.io:443`;
},
});
const crawler = new PuppeteerCrawler({
proxyConfiguration,
useSessionPool: true,
launchContext: {
launchOptions: { headless: "new" },
},
maxConcurrency: 4,
async requestHandler({ request, page, log }) {
log.info(`Visiting ${request.url}`);
await page.waitForSelector(".product");
const products = await page.$$eval(".product", (els) =>
els.map((el) => ({
title: el.querySelector("h2")?.textContent?.trim(),
price: el.querySelector(".price")?.textContent?.trim(),
})),
);
await Actor.pushData(products);
},
});
await crawler.run(["https://example.de/categories/electronics"]);
await Actor.exit(); 带输入Schema的按国家Actor
将country作为Apify Actor输入项暴露出来。Actor在启动时读取该值,并为对应的住宅IP池配置Shifter。同一份Actor代码适用于所有地区。
// .actor/input_schema.json
{
"title": "Localized Scraper Input",
"type": "object",
"schemaVersion": 1,
"properties": {
"startUrl": { "type": "string", "title": "Start URL", "default": "https://example.com" },
"country": { "type": "string", "title": "Country", "enum": ["us","uk","de","jp","fr","br"], "default": "us" },
"maxPages": { "type": "integer", "title": "Max Pages", "default": 100, "minimum": 1, "maximum": 5000 }
},
"required": ["startUrl", "country"]
}
// main.js
import { Actor } from "apify";
import { CheerioCrawler, ProxyConfiguration } from "crawlee";
await Actor.init();
const { startUrl, country, maxPages } = await Actor.getInput();
const proxyConfiguration = new ProxyConfiguration({
newUrlFunction: () => {
const sid = Math.random().toString(36).slice(2, 10);
return `customer-USERNAME-country-${country}-sid-${sid}-ttl-300:PASSWORD@p.shifter.io:443`;
},
});
const crawler = new CheerioCrawler({
proxyConfiguration,
maxRequestsPerCrawl: maxPages,
useSessionPool: true,
async requestHandler({ request, $, enqueueLinks }) {
await Actor.pushData({
country,
url: request.url,
title: $("title").text().trim(),
h1: $("h1").first().text().trim(),
});
await enqueueLinks({ strategy: "same-domain" });
},
});
await crawler.run([startUrl]);
await Actor.exit(); Crawlee之外的Apify SDK(自定义逻辑)
如果Crawlee不适合您的场景,您仍可从ProxyConfiguration获取Shifter代理URL,并与任意HTTP客户端配合使用。会话、重试和持久化功能均正常工作。
import { Actor } from "apify";
import { ProxyConfiguration } from "crawlee";
import { gotScraping } from "got-scraping";
await Actor.init();
const proxyConfiguration = new ProxyConfiguration({
newUrlFunction: () => {
const sid = Math.random().toString(36).slice(2, 10);
return `customer-USERNAME-country-fr-sid-${sid}:PASSWORD@p.shifter.io:443`;
},
});
// Pull a fresh proxy URL per logical task
async function fetchTarget(url) {
const proxyUrl = await proxyConfiguration.newUrl();
const html = await gotScraping({
url,
proxyUrl,
headerGeneratorOptions: {
browsers: [{ name: "chrome", minVersion: 120 }],
locales: ["en-US"],
},
}).text();
return html;
}
const urls = [
"https://example.fr/api/v1/products?page=1",
"https://example.fr/api/v1/products?page=2",
// ...
];
for (const url of urls) {
try {
const html = await fetchTarget(url);
await Actor.pushData({ url, length: html.length });
} catch (err) {
console.error(`Failed ${url}: ${err.message}`);
}
}
await Actor.exit(); 常见问题
关于将 Shifter 与 Apify 搭配使用的常见问题。
使用Crawlee的ProxyConfiguration类,传入`proxyUrls`数组(一个或多个Shifter URL)或返回每个会话新Shifter URL的`newUrlFunction`。将配置传递给爬虫——每个请求将自动通过Shifter路由。
开始将Shifter与以下工具配合使用 Apify
通过 Shifter 的 205M+ 住宅和 ISP 代理运行 Apify Actors。原生 Crawlee ProxyConfiguration、会话级粘性 IP,以及完整的 Cheerio / Puppeteer / Playwright 爬虫支持。