AI Web Scraping with One URL: The Practical Guide

For ten years the recipe for web scraping was the same. Fetch HTML. Write a CSS selector. Hope the site does not change. Repeat for every new template.

LLM web scraping flips the model. You describe the data you want as a JSON schema, hand the schema and the URL to an API, and an LLM reads the page and fills in the structure. No selectors. No regex. No per-site code.

This guide covers what that shift looks like in practice, where it works, where it does not, and how to build a small pipeline on top of a single-URL API like CrawlAI.

The selector era and why it ages badly

A CSS selector is a hardcoded address for a piece of HTML. div.product-title h1 works until the team behind the site renames the class to product__title or wraps the heading in another tag. The selector breaks silently. Your pipeline starts emitting empty strings or missing fields, and you only notice when a downstream system complains.

Three patterns make this worse:

The result is a maintenance tax. Every team running a scraper at scale ends up with a graveyard of half-broken selectors and a Slack channel of "the X scraper is down again" messages.

What LLM web scraping actually does

The AI approach swaps the address for a description. Instead of "give me the text inside div.product-title h1", you say "give me a string called title describing the product name".

A pipeline like CrawlAI then does four things:

  1. Loads the page in a headless browser so JavaScript-rendered content is available.
  2. Cleans the HTML down to the region you point it at (body by default, or a tighter CSS selector if you know one).
  3. Calls a language model (GPT-5 in this case) with the cleaned text and your JSON schema.
  4. Returns the JSON that matches the schema, alongside the raw page metadata.

The model does the semantic lookup. The schema enforces the shape. You read the JSON like any other API response.

A working example

Here is the minimum useful request. It pulls a product title, price, and stock status off a generic product page.

curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/123",
    "selector": "body",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "title":    { "type": "string", "description": "Product name as shown on the page" },
        "price":    { "type": "number", "description": "Numeric price in the page currency" },
        "currency": { "type": "string", "description": "ISO currency code, e.g. USD or EUR" },
        "inStock":  { "type": "boolean", "description": "Whether the page indicates the product is in stock" }
      }
    }
  }'

The response contains an aiAnalysis object shaped exactly like the schema. No parsing on your end. If a field is not present on the page, the model leaves it empty or null. The same request shape works for a blog post if you swap the schema, or a job listing, or a news article. That portability is the point.

When schema-driven extraction is the right tool

Schema extraction earns its keep when:

It is less useful when:

This last point is worth repeating. CrawlAI is a single-URL API. One URL in, one structured JSON out. There is no "crawl my whole site" endpoint. Building that on top is straightforward (your code holds the queue and calls the API per URL), but the API itself does not pretend to handle it.

Single-URL pipelines that scale

Most scraping use cases break down into the same shape:

  1. Maintain a list of URLs to process (sitemap, search results, your own seed list).
  2. For each URL, call the scrape API.
  3. Validate the returned JSON against your schema in your own code.
  4. Persist the result.

A 50-line script gets you a working pipeline. Queue, retry, persist. Because the API is stateless and per-URL, scaling is mostly a question of how many workers you want to run.

A few patterns help in practice:

How CrawlAI compares to other tools

A few tools in this space worth knowing about:

Honest take: none of these is universally best. Pick the one whose default scope matches your problem.

Common pitfalls

A few traps to watch out for when moving a pipeline from selectors to schema extraction:

Where to go next

If you want a hands-on walkthrough of building schemas for different page types, the extraction tutorial covers article, product, and contact-info schemas end to end.

If you are comparing CrawlAI to an alternative, the Crawl4AI comparison covers the hosted-vs-self-hosted tradeoff in detail.

To see the full API contract, the documentation lists every field, every error code, and language examples in cURL, JavaScript, Python, and PHP.

Try CrawlAI

Turn any URL into structured JSON with your own schema, powered by GPT-5. Pay-as-you-go starts at $10.