Extract Data with GPT-5: A Practical Tutorial
The premise of schema-driven extraction is simple. Describe the data you want as JSON. Hand the description and a URL to an API. Get the structured data back. No selectors, no parsing, no per-site code.
This tutorial is hands on. We will walk through three realistic schemas end to end: a product listing, an article, and a contact-info block. For each you will see the request, the response, and the schema design choices that matter. The high-level shift from selectors to schemas is covered in the main guide. This post focuses on the practical mechanics of doing the work.
The shape of every request
Before the examples, here is the API contract you will be using. One endpoint, three fields in the body.
POST https://crawlai.io/api/scrape/{token}
Content-Type: application/json
{
"url": "the page to extract from",
"selector": "optional CSS selector to narrow the region, defaults to body",
"jsonSchema": "the JSON schema describing the fields you want"
}
The response contains page metadata and an aiAnalysis object shaped exactly like your schema.
{
"success": true,
"data": {
"title": "page title from the document",
"finalUrl": "post-redirect URL",
"statusCode": 200,
"metaDescription": "meta description tag content",
"content": "cleaned page text",
"aiAnalysis": { "your": "fields", "go": "here" }
},
"remaining_calls": 999
}
That is the whole API. Now the examples.
Example one: a product listing
Imagine you are tracking competitor pricing on a list of product pages. You care about the product name, the price, the currency, and whether the item is in stock. The schema is small but the description on each field does a lot of work.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://example-shop.com/products/blue-widget",
"selector": "main",
"jsonSchema": {
"type": "object",
"properties": {
"title": { "type": "string", "description": "Full product name as shown on the page, excluding model numbers and color suffixes when separable" },
"price": { "type": "number", "description": "Numeric price as a decimal, in the units shown on the page. Do not include currency symbols." },
"currency": { "type": "string", "description": "ISO 4217 currency code such as USD, EUR, GBP. Infer from the symbol if the code is not present." },
"inStock": { "type": "boolean", "description": "True if the page clearly indicates the item is available to purchase right now, false if backordered or sold out, null if unclear" }
},
"required": ["title", "price", "currency"]
}
}'
A realistic response.
{
"success": true,
"data": {
"title": "Example Shop | Blue Widget",
"finalUrl": "https://example-shop.com/products/blue-widget",
"statusCode": 200,
"metaDescription": "The Blue Widget by Example Shop, made of recycled aluminium.",
"content": "Blue Widget\nA durable everyday widget...",
"aiAnalysis": {
"title": "Blue Widget",
"price": 24.99,
"currency": "EUR",
"inStock": true
}
},
"remaining_calls": 998
}
Notice the difference between data.title (whatever the page's <title> tag says) and aiAnalysis.title (the actual product name as understood by the model). Both are useful for different reasons. The page title is good for deduplication and logging. The product name is what you store in your database.
A few design choices worth calling out. The price field is a number, not a string. The description forbids currency symbols, which matters because the model will sometimes try to be helpful and include them. The currency field tells the model to infer the code from a symbol when needed, which handles pages that show "€24.99" without spelling out EUR. The inStock field has three states (true, false, null), which is much more useful than a binary that lies when the page is ambiguous.
For more on getting structured product data out of arbitrary pages, the HTML to JSON walkthrough covers the same idea with a different example.
Example two: an article
Now suppose you are building a content database. You want the title, the author, the publish date, and a rough word count for any article URL.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://example-news.com/2026/05/widgets-rising",
"selector": "article",
"jsonSchema": {
"type": "object",
"properties": {
"title": { "type": "string", "description": "Article headline as it appears at the top of the article body, not the SEO title from the head" },
"author": { "type": "string", "description": "Byline author name. If multiple authors are listed, join them with commas. Empty string if no byline is present." },
"publishDate": { "type": "string", "description": "Publication date in ISO 8601 format (YYYY-MM-DD). If only a relative date like '2 days ago' is shown, leave empty." },
"wordCount": { "type": "integer", "description": "Approximate number of words in the main article body, excluding navigation, ads, comments, and related-article strips." }
},
"required": ["title"]
}
}'
A realistic response.
{
"success": true,
"data": {
"title": "Widgets Rising | Example News",
"finalUrl": "https://example-news.com/2026/05/widgets-rising",
"statusCode": 200,
"metaDescription": "Why the widget market grew 30% this quarter.",
"content": "Widgets Rising\nBy Jane Doe...",
"aiAnalysis": {
"title": "Widgets Rising",
"author": "Jane Doe",
"publishDate": "2026-05-08",
"wordCount": 842
}
},
"remaining_calls": 997
}
A few things matter here. The selector is narrowed to article, which keeps related-article strips and comment sections out of the prompt. The publishDate description forces ISO 8601 and tells the model to leave it empty rather than guess, which prevents fake dates from showing up in your database. The wordCount is an integer with an explicit definition of what is and is not counted. Without that, the model would sometimes count navigation links or the comments section.
If your downstream use is RAG, the URL to LLM context post covers turning whole pages into clean context rather than into structured records.
Example three: contact information
Now the most common lead-enrichment job: pull a company's contact details off their site. Email, phone, and street address.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://acme.example.com/contact",
"selector": "body",
"jsonSchema": {
"type": "object",
"properties": {
"email": { "type": "string", "description": "Primary contact email. Prefer addresses like info@, hello@, contact@ over personal addresses. Empty string if none is shown." },
"phone": { "type": "string", "description": "Primary contact phone number in E.164 format if possible (e.g. +31201234567). Otherwise the format shown on the page." },
"address": {
"type": "object",
"description": "Postal address of the company headquarters or main office.",
"properties": {
"street": { "type": "string", "description": "Street name and number" },
"city": { "type": "string", "description": "City name" },
"postalCode": { "type": "string", "description": "Postal or ZIP code in the local format" },
"country": { "type": "string", "description": "Country name in English" }
}
}
}
}
}'
A realistic response.
{
"success": true,
"data": {
"title": "Contact Acme",
"finalUrl": "https://acme.example.com/contact",
"statusCode": 200,
"metaDescription": "Get in touch with the Acme team.",
"content": "Contact us at hello@acme.example.com...",
"aiAnalysis": {
"email": "hello@acme.example.com",
"phone": "+31201234567",
"address": {
"street": "Keizersgracht 123",
"city": "Amsterdam",
"postalCode": "1015 CJ",
"country": "Netherlands"
}
}
},
"remaining_calls": 996
}
This example shows a nested object. CrawlAI handles it fine, but as a rule, flatter schemas behave better. If you can express the same data without nesting, do. The address is a fair case for nesting because the fields belong together and are often missing as a group.
The email description is deliberately opinionated. By telling the model to prefer generic role addresses over personal ones, you avoid scraping individual employees from a press page or an author bio. That kind of nudge is the closest you get to prompt engineering with this API.
Schema design tips that compound
Across hundreds of extraction jobs, the same handful of patterns separate a flaky schema from a reliable one.
-
Descriptions matter more than names. A field called
priceis fine. A field calledpricewith a description that bans currency symbols and pins the unit is much better. Write descriptions as if you are briefing a careful contractor. -
Pick the right type. Numbers should be numbers, dates should be ISO 8601 strings, booleans should be booleans. If you let everything default to
stringyou push the parsing problem from the API into your code. - Allow null or empty for missing data. Pages have holes. A schema that does not allow holes will pressure the model into inventing values. Better to say "leave empty if not present" in the description.
-
Use
requiredonly for fields you genuinely need. Marking everything required is tempting and unhelpful. The model will still fill in the rest where it can. -
Narrow the selector when noise is obvious. If a page has a huge footer or a comments section, target
articleormainor a specific class. Less noise means a faster, cheaper, more accurate response. - Validate defensively. Even with a strict schema, type coercion can drift. Run your own checks: is the price a positive number? Is the date in range? Does the email contain an @? Treat the response as untrusted until validated.
- Cache by URL plus schema. The same input gives the same output, so a hash of (url, schema) is a perfect cache key. Skip the API call when you already have a recent result.
A note on cost and limits
Each scrape consumes one credit. That credit covers the page fetch, the cleaning step, and the GPT-5 call. There is no separate "AI" charge, which makes pricing easy to model: number of URLs equals number of credits. The response includes remaining_calls so your worker can stop early when it sees the bucket emptying.
If you are comparing this approach to running your own GPT-5 calls against raw HTML, the difference is not the model. It is the rendering, the anti-bot handling, the cleaning, and the predictable schema enforcement. The headless browser scraping post covers the rendering side of that in more detail.
Putting it together
The pattern across all three examples is the same.
- Decide what fields you need.
- Write a JSON schema with strong descriptions and the right types.
- Optionally narrow the selector to the meaningful region.
- POST the request, validate the response, store it.
The work that used to live in CSS selectors and brittle parsers now lives in the schema. The schema is more readable, more portable, and less likely to break when the site redesigns. That is the real value of using GPT-5 to extract data: the parsing logic moves out of your code and into something a human can read.
For a comparison with the more ad-hoc "ask ChatGPT to look at this page" approach, see the web scraping with ChatGPT post. For the full list of API fields, error codes, and language examples, the documentation has everything in one place.
Start with a small schema. Validate the response. Add fields as you need them. That is the whole tutorial.
Try CrawlAI
Turn any URL into structured JSON with your own schema, powered by GPT-5. Pay-as-you-go starts at $10.