What is the fastest way to convert HTML to JSON?

If you control the source and the schema is stable, a parser like cheerio or BeautifulSoup with hand-written selectors is fastest per call. If the source layout varies or you have many different pages, an LLM-based API that takes a URL and a JSON schema is faster to build and easier to maintain.

Can I convert HTML to JSON without writing CSS selectors?

Yes. With an LLM-powered API like CrawlAI, you describe the fields you want in a JSON schema and the model reads the page semantically. No selectors, no XPath, no regex. The tradeoff is per-call AI cost and a small amount of output variance.

How do I handle nested objects and lists in the schema?

Use standard JSON schema nesting. An object with properties for nested records, or an array with an items definition for lists. Keep field descriptions explicit, for example, telling the model that authors is an array of strings, one per byline.

What happens when a field is missing on the page?

CrawlAI returns the field as null or omits it from the aiAnalysis object. Your validation code should treat missing fields as expected and decide whether to retry, fall back, or accept a partial record.

Published March 15, 2026

HTML to JSON: Convert Any Web Page to Structured Data

Almost every scraping job ends the same way. You have HTML. You want JSON. The part in the middle is where the time goes.

For two decades the answer was a parser plus selectors. Fetch the page, walk the DOM, pluck out the text, coerce the types, hope nothing breaks. In 2026 there is a second answer: hand the page to a language model with a JSON schema and let it fill in the structure.

This guide covers both approaches honestly. When the parser path is the right call, when the LLM path saves you a week of work, and how to design schemas that produce useful JSON on the first try.

The two ways to turn HTML into JSON

The traditional pipeline looks like this.

Fetch the page with a HTTP client or a headless browser.
Parse the HTML into a DOM tree.
Run selectors against the tree to pull out the fields you want.
Coerce strings into numbers, dates, booleans.
Serialize the result as JSON.

Every step has a library. Cheerio in Node, BeautifulSoup in Python, jsoup on the JVM. They are mature, fast, and free.

The LLM pipeline collapses steps two through five.

Fetch the page (or hand a URL to a hosted service that fetches it for you).
Send the cleaned HTML and a JSON schema to a language model.
Receive JSON shaped exactly like the schema.

The model handles the parsing, the field lookup, and the type coercion. You skip selectors entirely. The main AI scraping guide walks through this shift in more detail.

Why selectors break

A CSS selector is a hardcoded address. div.product-title h1 is a bet that the team running the site will keep that class name and that nesting forever. Three things make that bet bad.

Build tools generate class names like css-1xj0pq that rotate on every deploy.
Frontend frameworks render content after page load, so the HTML you fetch is not the HTML the user sees.
A/B tests serve different layouts to different visitors.

The result is a maintenance tax. A scraper that worked yesterday returns empty strings today, and nobody notices until a downstream report is wrong.

LLM-based extraction is not immune to this, but it fails differently. When a page changes, the model still reads the new layout and usually still finds the field. You trade a brittle exact-match system for a fuzzy semantic one.

A minimum working example

Here is the smallest useful request against the CrawlAI API. It pulls a product title, price, and stock status from any product page.

curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/sku-123",
    "selector": "body",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "title":    { "type": "string",  "description": "Product name shown on the page" },
        "price":    { "type": "number",  "description": "Numeric price in the page currency" },
        "currency": { "type": "string",  "description": "ISO currency code, for example USD or EUR" },
        "inStock":  { "type": "boolean", "description": "Whether the page indicates the product is available" },
        "sku":      { "type": "string",  "description": "Stock keeping unit if shown on the page" }
      },
      "required": ["title"]
    }
  }'

The response wraps the structured JSON inside the standard envelope.

{
  "success": true,
  "data": {
    "title": "Mid-century lounge chair",
    "finalUrl": "https://example.com/product/sku-123",
    "statusCode": 200,
    "metaDescription": "Walnut frame, wool upholstery, ships in 2 weeks.",
    "content": "Mid-century lounge chair. Walnut frame...",
    "aiAnalysis": {
      "title": "Mid-century lounge chair",
      "price": 1295,
      "currency": "USD",
      "inStock": true,
      "sku": "LC-WAL-2025"
    }
  },
  "remaining_calls": 998
}

You read data.aiAnalysis and you have your record. The rest of the envelope (title, metaDescription, finalUrl, statusCode) is useful for deduping and error handling. Full reference is in the docs.

Schema design tips that matter

The schema is the contract. A bad schema produces a bad result even with a perfect model. A few patterns that pay off in practice.

Describe every field

A field named name with no description gives the model nothing to anchor on. Compare:

{ "name": { "type": "string" } }

versus

{ "name": { "type": "string", "description": "Full product name as shown on the page, excluding model number" } }

The second version produces dramatically more consistent results. Write descriptions as if you were briefing a contractor who has never seen the page.

Use enums for known categories

If a field has a small set of valid values, declare them. The model will pick from the list instead of inventing variants.

{
  "condition": {
    "type": "string",
    "enum": ["new", "refurbished", "used", "unknown"],
    "description": "Condition of the listed item"
  }
}

This eliminates entire classes of normalisation work on your end.

Lists need explicit items

When you want an array, give the model a clear items definition.

{
  "authors": {
    "type": "array",
    "items": { "type": "string" },
    "description": "Author names from the byline, one per array entry"
  }
}

Without items, the model has to guess. With it, the output is predictable.

Flatten when you can

Deeply nested schemas are harder for the model to fill in correctly. A two-level structure with five fields each is fine. A four-level structure with twenty fields each is asking for trouble. If you can flatten without losing meaning, flatten.

Handling the awkward cases

Real pages are messy. A few patterns for the edges.

Missing fields

When a field is not on the page, the model returns null or omits it. Decide in advance whether your downstream system treats that as an error, a retry signal, or a normal partial record. Mark only the fields you truly need as required in your schema.

Wrong types

The model can return a price as "19.99" instead of 19.99. The schema reduces this but does not eliminate it. Coerce types defensively in your own code before writing to a database.

Multiple candidates

A page might list the same field three different ways (header, breadcrumb, structured data). The model picks one. If you care which one, narrow the selector field to point at the canonical region. For example, selector: "main" cuts out the header and footer entirely.

Hallucination

The model can invent data that is not on the page, especially when the schema asks for a field that is genuinely absent. The mitigations are a tight selector, an explicit description that says "leave null if not present", and validation on your end. Treat the response as untrusted until checked. The GPT-5 extraction tutorial covers validation patterns in depth.

Selector libraries vs LLM extraction

A short comparison of when each path is the right call.

Concern	Selector libraries (cheerio, BeautifulSoup)	LLM extraction (CrawlAI)
Per-call cost	Tiny	Higher, includes model inference
Setup time	Hours to days per site	Minutes per schema
Maintenance when layout changes	Manual selector updates	Usually still works
Handles many different sites	Painful, one parser per site	Same code, swap the URL
Output predictability	Exact	Mostly consistent, validate output
JavaScript rendering	Need a headless browser	Built in
Best for	One site, very high volume	Many sites, moderate volume

If you are pulling a million pages a day from a single site whose markup you control, write selectors. If you are pulling from a hundred different sites at modest volume, the schema-driven path is faster to ship and cheaper to maintain.

A practical workflow

Most teams converting HTML to JSON in 2026 end up with a hybrid.

Maintain the URL list yourself (sitemaps, search results, partner feeds). CrawlAI does not crawl, so URL discovery is your code's job. The Firecrawl alternative page covers when you might want a full crawler instead.
Call the extraction API per URL with a schema tailored to the page type.
Validate the returned JSON. Type checks, bounds checks, null handling.
Persist. Move on.

This pattern is the same whether you have ten URLs or ten million. The API is stateless and per-URL, so scaling is mostly a question of how many workers you run. For converting that same content into plain markdown rather than structured records, the URL to markdown guide covers the other side of the same coin.

Where to go from here

If your job is structured extraction (lead enrichment, competitor monitoring, classification, content cataloguing), the schema-driven path is almost always the right starting point. You write a schema once, you reuse it across thousands of pages, and the maintenance burden stays close to zero.

If you want a deeper walkthrough of writing schemas for different page types (articles, products, contact pages, job listings), the GPT-5 extraction tutorial is the next read. For the broader strategic picture, the main AI scraping guide covers when this approach earns its keep and when it does not.

To see the full API contract, including every field on the response envelope and language examples in cURL, JavaScript, Python, and PHP, head to the documentation.

Try CrawlAI

Turn any URL into structured JSON with your own schema, powered by GPT-5. Pay-as-you-go starts at $10.

Get Started Read the docs