HTML to JSON: Convert Any Web Page to Structured Data
Almost every scraping job ends the same way. You have HTML. You want JSON. The part in the middle is where the time goes.
For two decades the answer was a parser plus selectors. Fetch the page, walk the DOM, pluck out the text, coerce the types, hope nothing breaks. In 2026 there is a second answer: hand the page to a language model with a JSON schema and let it fill in the structure.
This guide covers both approaches honestly. When the parser path is the right call, when the LLM path saves you a week of work, and how to design schemas that produce useful JSON on the first try.
The two ways to turn HTML into JSON
The traditional pipeline looks like this.
- Fetch the page with a HTTP client or a headless browser.
- Parse the HTML into a DOM tree.
- Run selectors against the tree to pull out the fields you want.
- Coerce strings into numbers, dates, booleans.
- Serialize the result as JSON.
Every step has a library. Cheerio in Node, BeautifulSoup in Python, jsoup on the JVM. They are mature, fast, and free.
The LLM pipeline collapses steps two through five.
- Fetch the page (or hand a URL to a hosted service that fetches it for you).
- Send the cleaned HTML and a JSON schema to a language model.
- Receive JSON shaped exactly like the schema.
The model handles the parsing, the field lookup, and the type coercion. You skip selectors entirely. The main AI scraping guide walks through this shift in more detail.
Why selectors break
A CSS selector is a hardcoded address. div.product-title h1 is a bet that the team running the site will keep that class name and that nesting forever. Three things make that bet bad.
- Build tools generate class names like
css-1xj0pqthat rotate on every deploy. - Frontend frameworks render content after page load, so the HTML you fetch is not the HTML the user sees.
- A/B tests serve different layouts to different visitors.
The result is a maintenance tax. A scraper that worked yesterday returns empty strings today, and nobody notices until a downstream report is wrong.
LLM-based extraction is not immune to this, but it fails differently. When a page changes, the model still reads the new layout and usually still finds the field. You trade a brittle exact-match system for a fuzzy semantic one.
A minimum working example
Here is the smallest useful request against the CrawlAI API. It pulls a product title, price, and stock status from any product page.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/product/sku-123",
"selector": "body",
"jsonSchema": {
"type": "object",
"properties": {
"title": { "type": "string", "description": "Product name shown on the page" },
"price": { "type": "number", "description": "Numeric price in the page currency" },
"currency": { "type": "string", "description": "ISO currency code, for example USD or EUR" },
"inStock": { "type": "boolean", "description": "Whether the page indicates the product is available" },
"sku": { "type": "string", "description": "Stock keeping unit if shown on the page" }
},
"required": ["title"]
}
}'
The response wraps the structured JSON inside the standard envelope.
{
"success": true,
"data": {
"title": "Mid-century lounge chair",
"finalUrl": "https://example.com/product/sku-123",
"statusCode": 200,
"metaDescription": "Walnut frame, wool upholstery, ships in 2 weeks.",
"content": "Mid-century lounge chair. Walnut frame...",
"aiAnalysis": {
"title": "Mid-century lounge chair",
"price": 1295,
"currency": "USD",
"inStock": true,
"sku": "LC-WAL-2025"
}
},
"remaining_calls": 998
}
You read data.aiAnalysis and you have your record. The rest of the envelope (title, metaDescription, finalUrl, statusCode) is useful for deduping and error handling. Full reference is in the docs.
Schema design tips that matter
The schema is the contract. A bad schema produces a bad result even with a perfect model. A few patterns that pay off in practice.
Describe every field
A field named name with no description gives the model nothing to anchor on. Compare:
{ "name": { "type": "string" } }
versus
{ "name": { "type": "string", "description": "Full product name as shown on the page, excluding model number" } }
The second version produces dramatically more consistent results. Write descriptions as if you were briefing a contractor who has never seen the page.
Use enums for known categories
If a field has a small set of valid values, declare them. The model will pick from the list instead of inventing variants.
{
"condition": {
"type": "string",
"enum": ["new", "refurbished", "used", "unknown"],
"description": "Condition of the listed item"
}
}
This eliminates entire classes of normalisation work on your end.
Lists need explicit items
When you want an array, give the model a clear items definition.
{
"authors": {
"type": "array",
"items": { "type": "string" },
"description": "Author names from the byline, one per array entry"
}
}
Without items, the model has to guess. With it, the output is predictable.
Flatten when you can
Deeply nested schemas are harder for the model to fill in correctly. A two-level structure with five fields each is fine. A four-level structure with twenty fields each is asking for trouble. If you can flatten without losing meaning, flatten.
Handling the awkward cases
Real pages are messy. A few patterns for the edges.
Missing fields
When a field is not on the page, the model returns null or omits it. Decide in advance whether your downstream system treats that as an error, a retry signal, or a normal partial record. Mark only the fields you truly need as required in your schema.
Wrong types
The model can return a price as "19.99" instead of 19.99. The schema reduces this but does not eliminate it. Coerce types defensively in your own code before writing to a database.
Multiple candidates
A page might list the same field three different ways (header, breadcrumb, structured data). The model picks one. If you care which one, narrow the selector field to point at the canonical region. For example, selector: "main" cuts out the header and footer entirely.
Hallucination
The model can invent data that is not on the page, especially when the schema asks for a field that is genuinely absent. The mitigations are a tight selector, an explicit description that says "leave null if not present", and validation on your end. Treat the response as untrusted until checked. The GPT-5 extraction tutorial covers validation patterns in depth.
Selector libraries vs LLM extraction
A short comparison of when each path is the right call.
| Concern | Selector libraries (cheerio, BeautifulSoup) | LLM extraction (CrawlAI) |
|---|---|---|
| Per-call cost | Tiny | Higher, includes model inference |
| Setup time | Hours to days per site | Minutes per schema |
| Maintenance when layout changes | Manual selector updates | Usually still works |
| Handles many different sites | Painful, one parser per site | Same code, swap the URL |
| Output predictability | Exact | Mostly consistent, validate output |
| JavaScript rendering | Need a headless browser | Built in |
| Best for | One site, very high volume | Many sites, moderate volume |
If you are pulling a million pages a day from a single site whose markup you control, write selectors. If you are pulling from a hundred different sites at modest volume, the schema-driven path is faster to ship and cheaper to maintain.
A practical workflow
Most teams converting HTML to JSON in 2026 end up with a hybrid.
- Maintain the URL list yourself (sitemaps, search results, partner feeds). CrawlAI does not crawl, so URL discovery is your code's job. The Firecrawl alternative page covers when you might want a full crawler instead.
- Call the extraction API per URL with a schema tailored to the page type.
- Validate the returned JSON. Type checks, bounds checks, null handling.
- Persist. Move on.
This pattern is the same whether you have ten URLs or ten million. The API is stateless and per-URL, so scaling is mostly a question of how many workers you run. For converting that same content into plain markdown rather than structured records, the URL to markdown guide covers the other side of the same coin.
Where to go from here
If your job is structured extraction (lead enrichment, competitor monitoring, classification, content cataloguing), the schema-driven path is almost always the right starting point. You write a schema once, you reuse it across thousands of pages, and the maintenance burden stays close to zero.
If you want a deeper walkthrough of writing schemas for different page types (articles, products, contact pages, job listings), the GPT-5 extraction tutorial is the next read. For the broader strategic picture, the main AI scraping guide covers when this approach earns its keep and when it does not.
To see the full API contract, including every field on the response envelope and language examples in cURL, JavaScript, Python, and PHP, head to the documentation.
Try CrawlAI
Turn any URL into structured JSON with your own schema, powered by GPT-5. Pay-as-you-go starts at $10.