AI Web Scraping with One URL: The Practical Guide
For ten years the recipe for web scraping was the same. Fetch HTML. Write a CSS selector. Hope the site does not change. Repeat for every new template.
LLM web scraping flips the model. You describe the data you want as a JSON schema, hand the schema and the URL to an API, and an LLM reads the page and fills in the structure. No selectors. No regex. No per-site code.
This guide covers what that shift looks like in practice, where it works, where it does not, and how to build a small pipeline on top of a single-URL API like CrawlAI.
The selector era and why it ages badly
A CSS selector is a hardcoded address for a piece of HTML. div.product-title h1 works until the team behind the site renames the class to product__title or wraps the heading in another tag. The selector breaks silently. Your pipeline starts emitting empty strings or missing fields, and you only notice when a downstream system complains.
Three patterns make this worse:
- Modern marketing sites use class names generated by build tools (
css-1xj0pq) that change on every deploy. - JavaScript frameworks render content client side. The HTML you fetch is not the HTML a browser shows the user.
- Sites are A/B testing layouts more aggressively. The same URL can return different markup to different visitors.
The result is a maintenance tax. Every team running a scraper at scale ends up with a graveyard of half-broken selectors and a Slack channel of "the X scraper is down again" messages.
What LLM web scraping actually does
The AI approach swaps the address for a description. Instead of "give me the text inside div.product-title h1", you say "give me a string called title describing the product name".
A pipeline like CrawlAI then does four things:
- Loads the page in a headless browser so JavaScript-rendered content is available.
-
Cleans the HTML down to the region you point it at (
bodyby default, or a tighter CSS selector if you know one). - Calls a language model (GPT-5 in this case) with the cleaned text and your JSON schema.
- Returns the JSON that matches the schema, alongside the raw page metadata.
The model does the semantic lookup. The schema enforces the shape. You read the JSON like any other API response.
A working example
Here is the minimum useful request. It pulls a product title, price, and stock status off a generic product page.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/product/123",
"selector": "body",
"jsonSchema": {
"type": "object",
"properties": {
"title": { "type": "string", "description": "Product name as shown on the page" },
"price": { "type": "number", "description": "Numeric price in the page currency" },
"currency": { "type": "string", "description": "ISO currency code, e.g. USD or EUR" },
"inStock": { "type": "boolean", "description": "Whether the page indicates the product is in stock" }
}
}
}'
The response contains an aiAnalysis object shaped exactly like the schema. No parsing on your end. If a field is not present on the page, the model leaves it empty or null. The same request shape works for a blog post if you swap the schema, or a job listing, or a news article. That portability is the point.
When schema-driven extraction is the right tool
Schema extraction earns its keep when:
- You are processing pages from many different sources and cannot write a parser per source.
- You care about meaning more than structure (find the contact email, regardless of where it lives on the page).
- The pages are recent enough that a general-purpose model has seen the patterns before.
- Volume is modest. Hundreds to low millions of pages, not billions per day.
It is less useful when:
- You scrape one site at very high volume. Per-call AI cost matters, hand-tuned selectors win.
- You need pixel-perfect, structurally identical output across runs. LLMs vary slightly between calls.
- You need to traverse a site. CrawlAI extracts from one URL. If you want to crawl an entire domain, you need a different tool or a small loop in your own code.
This last point is worth repeating. CrawlAI is a single-URL API. One URL in, one structured JSON out. There is no "crawl my whole site" endpoint. Building that on top is straightforward (your code holds the queue and calls the API per URL), but the API itself does not pretend to handle it.
Single-URL pipelines that scale
Most scraping use cases break down into the same shape:
- Maintain a list of URLs to process (sitemap, search results, your own seed list).
- For each URL, call the scrape API.
- Validate the returned JSON against your schema in your own code.
- Persist the result.
A 50-line script gets you a working pipeline. Queue, retry, persist. Because the API is stateless and per-URL, scaling is mostly a question of how many workers you want to run.
A few patterns help in practice:
- Cache aggressively. Hash the URL plus the schema and skip the call if you already have a fresh result.
- Narrow the selector when the page has obvious noise (sidebars, ad blocks, related-articles strips). A tighter selector means less text in the prompt, which means faster and cheaper calls.
- Validate. The response is structured but the model can still get a field wrong. Run your own type checks and bounds checks before writing to a database.
How CrawlAI compares to other tools
A few tools in this space worth knowing about:
- Firecrawl crawls entire sites and returns markdown or JSON. If you need to discover and process every URL on a domain, that is what it is built for. CrawlAI is narrower: single URL, structured output, less to learn. See the head-to-head comparison for the full breakdown.
- Crawl4AI is a popular open-source Python library. If you are comfortable hosting your own scrapers and using your own OpenAI key, it is excellent. CrawlAI is the hosted equivalent, with anti-bot and rendering handled for you. The Crawl4AI vs CrawlAI post covers the hosted-vs-self-hosted tradeoff in detail.
- Diffbot offers pre-built extractors for common page types (article, product, organization). That works well if the page fits the templates. CrawlAI uses your own JSON schema instead, so the shape of the output is up to you. More on this in the Diffbot alternative post.
Honest take: none of these is universally best. Pick the one whose default scope matches your problem.
Common pitfalls
A few traps to watch out for when moving a pipeline from selectors to schema extraction:
- Vague field descriptions lead to vague results. "name" is worse than "Full product name as shown on the page, excluding model number". Write descriptions as if you are briefing a human contractor.
- Schemas that are too deeply nested are harder for the model. If you can flatten a structure, do.
-
Trusting the output blindly. Even with a strict schema, a
pricefield can come back as a string"19.99"instead of a number. Coerce types defensively. -
Ignoring the page metadata. The response also includes
title,metaDescription,finalUrl, andstatusCode. These are useful for deduping and error handling, and you get them for free.
Where to go next
If you want a hands-on walkthrough of building schemas for different page types, the extraction tutorial covers article, product, and contact-info schemas end to end.
If you are comparing CrawlAI to an alternative, the Crawl4AI comparison covers the hosted-vs-self-hosted tradeoff in detail.
To see the full API contract, the documentation lists every field, every error code, and language examples in cURL, JavaScript, Python, and PHP.
Try CrawlAI
Turn any URL into structured JSON with your own schema, powered by GPT-5. Pay-as-you-go starts at $10.