Skip to content
Login Sign up

Extraction rules

Skip BeautifulSoup, Cheerio, and regex. Tell Shifter which CSS selectors map to which JSON fields, and get clean JSON back.

Pass extract_rules as a URL-encoded JSON object where each key is an output field and each value describes how to extract it.

Minimal example:

{
"title": { "selector": "h1", "output": "text" }
}

URL-encoded and sent:

Terminal window
curl "https://scrape.shifter.io/v1?api_key=YOUR_API_KEY&url=https://example.com&extract_rules=%7B%22title%22%3A%7B%22selector%22%3A%22h1%22%2C%22output%22%3A%22text%22%7D%7D"
# {"title": "Example Domain"}
outputReturnsExample
textText content of the element"Example Domain"
htmlInner HTML"<strong>Example</strong> Domain"
@<attr>Attribute value"@href""https://example.com/"

Add keys to the rules object:

{
"title": { "selector": "h1", "output": "text" },
"description": { "selector": "meta[name=description]", "output": "@content" },
"canonical": { "selector": "link[rel=canonical]", "output": "@href" }
}

Response:

{
"title": "Example Domain",
"description": "The example domain...",
"canonical": "https://example.com/"
}

For lists (search results, product cards, table rows), wrap the rule in a parent that specifies type: "list" and item:

{
"products": {
"selector": "div.product",
"type": "list",
"item": {
"name": { "selector": "h2", "output": "text" },
"price": { "selector": ".price", "output": "text" },
"link": { "selector": "a.title", "output": "@href" }
}
}
}

Response:

{
"products": [
{ "name": "Item A", "price": "$19.99", "link": "/item-a" },
{ "name": "Item B", "price": "$24.50", "link": "/item-b" }
]
}

For endpoints that already return JSON (most REST APIs), add auto_parser=1 to parse the body and return it as-is:

Terminal window
curl "https://scrape.shifter.io/v1?api_key=YOUR_API_KEY&url=https://api.example.com/products&auto_parser=1"
  • Test selectors with the browser dev console first: document.querySelector(...).
  • Escape URL-encoded JSON properly. Most HTTP clients do this automatically when you pass the rules as a parameter object.
  • If a field is missing on the page, it returns null rather than failing the request.