Documentation
CrawlAI is scrape + AI in one call. You give it a URL and a JSON schema describing what you want, and it returns that exact structure, filled in by GPT-5 from the page contents. No selectors, no parsing, no per-site code.
Overview
One POST per page. The request takes three things:
url— the page to crawl.selector— CSS selector for the region of the page to feed the AI (defaults tobody).jsonSchema— a JSON Schema describing the fields you want extracted.
The response includes the usual page metadata (title, meta description, final URL, status code) plus an aiAnalysis object that matches your schema. One credit = one scrape + one AI extraction.
If you skip jsonSchema, you'll get the page metadata but aiAnalysis will be null.
Base URL:
https://crawlai.io
Authentication
Every request is authenticated by passing your API token as part of the URL path. There are no headers, cookies, or OAuth flows. Treat the token like a password and don't commit it to public repositories.
https://crawlai.io/api/scrape/{your_api_token}
Quickstart
Extract a company's country, industry, and contact email from their homepage in a single call:
curlcurl -X POST https://crawlai.io/api/scrape/YOUR_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://boei.help",
"selector": "body",
"jsonSchema": {
"type": "object",
"properties": {
"country": { "type": "string", "description": "Country where the company is based" },
"industry": { "type": "string", "description": "Industry of the company" },
"contactInfo": {
"type": "object",
"properties": {
"email": { "type": "string", "description": "Contact email" }
}
}
}
}
}'
The aiAnalysis field in the response will mirror the shape of your schema with the extracted values filled in.
What you can build
The whole point of CrawlAI is that you don't write parsers. You describe the shape of the data you want and GPT-5 pulls it out of the page. Each recipe below is one API call with a different jsonSchema. Pick your language with the tabs — your choice is remembered across all blocks.
Lead enrichment
Take a list of company domains and turn each one into a structured row: industry, country, value proposition, contact email. Drop it straight into your CRM. No "if Shopify do X, if Webflow do Y" parsing.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://acme.com",
"jsonSchema": {
"type": "object",
"properties": {
"industry": { "type": "string", "description": "Industry of the company" },
"country": { "type": "string", "description": "Country where the company is based" },
"valueProp": { "type": "string", "description": "One-sentence summary of what they sell" },
"email": { "type": "string", "description": "Contact email" }
}
}
}'
const schema = {
type: 'object',
properties: {
industry: { type: 'string', description: 'Industry of the company' },
country: { type: 'string', description: 'Country where the company is based' },
valueProp: { type: 'string', description: 'One-sentence summary of what they sell' },
email: { type: 'string', description: 'Contact email' },
},
};
const { data } = await fetch(`https://crawlai.io/api/scrape/${TOKEN}`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ url: `https://${domain}`, jsonSchema: schema }),
}).then(r => r.json());
await crm.update(leadId, data.aiAnalysis);
schema = {
"type": "object",
"properties": {
"industry": {"type": "string", "description": "Industry of the company"},
"country": {"type": "string", "description": "Country where the company is based"},
"valueProp": {"type": "string", "description": "One-sentence summary of what they sell"},
"email": {"type": "string", "description": "Contact email"},
},
}
r = requests.post(
f"https://crawlai.io/api/scrape/{token}",
json={"url": f"https://{domain}", "jsonSchema": schema},
).json()
crm.update(lead_id, r["data"]["aiAnalysis"])
$schema = [
'type' => 'object',
'properties' => [
'industry' => ['type' => 'string', 'description' => 'Industry of the company'],
'country' => ['type' => 'string', 'description' => 'Country where the company is based'],
'valueProp' => ['type' => 'string', 'description' => 'One-sentence summary of what they sell'],
'email' => ['type' => 'string', 'description' => 'Contact email'],
],
];
$res = Http::post("https://crawlai.io/api/scrape/{$token}", [
'url' => "https://{$domain}",
'jsonSchema' => $schema,
])->json();
$crm->update($leadId, $res['data']['aiAnalysis']);
Competitor monitoring
Scrape a competitor's pricing or product pages on a schedule and extract the bits that actually matter (plan names, prices, features). Diff the structured output between runs to alert on real changes, not cosmetic ones.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://competitor.com/pricing",
"jsonSchema": {
"type": "object",
"properties": {
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "string", "description": "Price as shown" },
"features": { "type": "array", "items": { "type": "string" } }
}
}
}
}
}
}'
const schema = {
type: 'object',
properties: {
plans: {
type: 'array',
items: {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'string', description: 'Price as shown' },
features: { type: 'array', items: { type: 'string' } },
},
},
},
},
};
const today = (await scrape({ url: 'https://competitor.com/pricing', jsonSchema: schema })).data.aiAnalysis;
const prev = await db.load('competitor_pricing');
if (JSON.stringify(today) !== JSON.stringify(prev)) {
notifySlack('Competitor pricing changed', today);
await db.save('competitor_pricing', today);
}
schema = {
"type": "object",
"properties": {
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string", "description": "Price as shown"},
"features": {"type": "array", "items": {"type": "string"}},
},
},
},
},
}
today = scrape("https://competitor.com/pricing", schema)["data"]["aiAnalysis"]
if today != db.load("competitor_pricing"):
notify_slack("Competitor pricing changed", today)
db.save("competitor_pricing", today)
$schema = [
'type' => 'object',
'properties' => [
'plans' => [
'type' => 'array',
'items' => [
'type' => 'object',
'properties' => [
'name' => ['type' => 'string'],
'price' => ['type' => 'string', 'description' => 'Price as shown'],
'features' => ['type' => 'array', 'items' => ['type' => 'string']],
],
],
],
],
];
$today = scrape('https://competitor.com/pricing', $schema)['data']['aiAnalysis'];
if ($today != Cache::get('competitor_pricing')) {
notifySlack('Competitor pricing changed', $today);
Cache::put('competitor_pricing', $today);
}
Market research
Pass a list of company URLs through the same schema to build a comparable dataset (positioning, target segment, headline features) across an entire market in minutes instead of weeks.
# Run once per competitor URL, then aggregate the results.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://competitor.com",
"jsonSchema": {
"type": "object",
"properties": {
"positioning": { "type": "string", "description": "How the company positions itself" },
"targetMarket": { "type": "string", "description": "Who their product is for" },
"headlineFeatures": { "type": "array", "items": { "type": "string" } }
}
}
}'
const schema = {
type: 'object',
properties: {
positioning: { type: 'string', description: 'How the company positions itself' },
targetMarket: { type: 'string', description: 'Who their product is for' },
headlineFeatures: { type: 'array', items: { type: 'string' } },
},
};
const rows = await Promise.all(
competitorUrls.map(async url =>
(await scrape({ url, jsonSchema: schema })).data.aiAnalysis
)
);
writeCsv('market_landscape.csv', rows);
schema = {
"type": "object",
"properties": {
"positioning": {"type": "string", "description": "How the company positions itself"},
"targetMarket": {"type": "string", "description": "Who their product is for"},
"headlineFeatures": {"type": "array", "items": {"type": "string"}},
},
}
rows = [scrape(u, schema)["data"]["aiAnalysis"] for u in competitor_urls]
write_csv("market_landscape.csv", rows)
$schema = [
'type' => 'object',
'properties' => [
'positioning' => ['type' => 'string', 'description' => 'How the company positions itself'],
'targetMarket' => ['type' => 'string', 'description' => 'Who their product is for'],
'headlineFeatures' => ['type' => 'array', 'items' => ['type' => 'string']],
],
];
$rows = array_map(
fn($u) => scrape($u, $schema)['data']['aiAnalysis'],
$competitorUrls
);
writeCsv('market_landscape.csv', $rows);
Classification & tagging
Have a pile of URLs you don't know much about? Define a schema with an enum of categories and CrawlAI will classify each one with reasoning attached. Way more accurate than guessing from the domain.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"jsonSchema": {
"type": "object",
"properties": {
"pageType": {
"type": "string",
"enum": ["homepage", "pricing", "blog post", "product page", "docs", "other"]
},
"businessCategory": {
"type": "string",
"enum": ["saas", "ecommerce", "agency", "media", "nonprofit", "other"]
},
"reasoning": { "type": "string", "description": "Brief justification" }
}
}
}'
const schema = {
type: 'object',
properties: {
pageType: {
type: 'string',
enum: ['homepage', 'pricing', 'blog post', 'product page', 'docs', 'other'],
},
businessCategory: {
type: 'string',
enum: ['saas', 'ecommerce', 'agency', 'media', 'nonprofit', 'other'],
},
reasoning: { type: 'string', description: 'Brief justification' },
},
};
const { aiAnalysis } = (await scrape({ url, jsonSchema: schema })).data;
// => { pageType: 'pricing', businessCategory: 'saas', reasoning: '...' }
schema = {
"type": "object",
"properties": {
"pageType": {
"type": "string",
"enum": ["homepage", "pricing", "blog post", "product page", "docs", "other"],
},
"businessCategory": {
"type": "string",
"enum": ["saas", "ecommerce", "agency", "media", "nonprofit", "other"],
},
"reasoning": {"type": "string", "description": "Brief justification"},
},
}
ai = scrape(url, schema)["data"]["aiAnalysis"]
# => {"pageType": "pricing", "businessCategory": "saas", "reasoning": "..."}
$schema = [
'type' => 'object',
'properties' => [
'pageType' => [
'type' => 'string',
'enum' => ['homepage', 'pricing', 'blog post', 'product page', 'docs', 'other'],
],
'businessCategory' => [
'type' => 'string',
'enum' => ['saas', 'ecommerce', 'agency', 'media', 'nonprofit', 'other'],
],
'reasoning' => ['type' => 'string', 'description' => 'Brief justification'],
],
];
$ai = scrape($url, $schema)['data']['aiAnalysis'];
// => ['pageType' => 'pricing', 'businessCategory' => 'saas', 'reasoning' => '...']
Structured extraction from any page
Job listings, event pages, product detail pages, real estate listings — anywhere you'd normally write a custom scraper, just describe the fields and let the AI do it.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://jobs.example.com/listing/123",
"jsonSchema": {
"type": "object",
"properties": {
"title": { "type": "string" },
"company": { "type": "string" },
"location": { "type": "string" },
"remote": { "type": "boolean", "description": "Whether the role is remote" },
"salary": { "type": "string", "description": "Salary range as shown, or empty" },
"skills": { "type": "array", "items": { "type": "string" } }
}
}
}'
const schema = {
type: 'object',
properties: {
title: { type: 'string' },
company: { type: 'string' },
location: { type: 'string' },
remote: { type: 'boolean', description: 'Whether the role is remote' },
salary: { type: 'string', description: 'Salary range as shown, or empty' },
skills: { type: 'array', items: { type: 'string' } },
},
};
const { aiAnalysis } = (await scrape({ url: jobUrl, jsonSchema: schema })).data;
await db.jobs.insert(aiAnalysis);
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"company": {"type": "string"},
"location": {"type": "string"},
"remote": {"type": "boolean", "description": "Whether the role is remote"},
"salary": {"type": "string", "description": "Salary range as shown, or empty"},
"skills": {"type": "array", "items": {"type": "string"}},
},
}
ai = scrape(job_url, schema)["data"]["aiAnalysis"]
db.jobs.insert(ai)
$schema = [
'type' => 'object',
'properties' => [
'title' => ['type' => 'string'],
'company' => ['type' => 'string'],
'location' => ['type' => 'string'],
'remote' => ['type' => 'boolean', 'description' => 'Whether the role is remote'],
'salary' => ['type' => 'string', 'description' => 'Salary range as shown, or empty'],
'skills' => ['type' => 'array', 'items' => ['type' => 'string']],
],
];
$ai = scrape($jobUrl, $schema)['data']['aiAnalysis'];
DB::table('jobs')->insert($ai);
Feeding RAG / LLM pipelines
If you only want clean page text for embeddings, skip jsonSchema and use the content field. The page is already de-boilerplated and JS-rendered, so it drops straight into an embedding pipeline.
# No jsonSchema -> response.data.content is the clean page text.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
const { data } = await fetch(`https://crawlai.io/api/scrape/${TOKEN}`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ url }),
}).then(r => r.json());
const embeddings = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: chunk(data.content, 1000),
});
await vectorDb.upsert(embeddings, { source: data.finalUrl, title: data.title });
data = requests.post(
f"https://crawlai.io/api/scrape/{token}",
json={"url": url},
).json()["data"]
embeddings = openai.embeddings.create(
model="text-embedding-3-small",
input=chunk(data["content"], 1000),
)
vector_db.upsert(embeddings, source=data["finalUrl"], title=data["title"])
$data = Http::post("https://crawlai.io/api/scrape/{$token}", ['url' => $url])
->json()['data'];
$embeddings = $openai->embeddings([
'model' => 'text-embedding-3-small',
'input' => chunkText($data['content'], 1000),
]);
$vectorDb->upsert($embeddings, ['source' => $data['finalUrl'], 'title' => $data['title']]);
description fields are, the better the extraction. Treat them as instructions to GPT-5, not just labels.
POST/api/scrape/{token}
Scrape a single URL and optionally run GPT-5 extraction against a JSON schema you supply. Consumes one credit per successful call. Failed calls (4xx, 5xx) are automatically refunded.
Request body
| Field | Type | Required | Description |
|---|---|---|---|
url | string | yes | The fully qualified URL to scrape (must include scheme). |
selector | string | no | CSS selector for the region to feed the AI. Defaults to body. Narrow it (e.g. main, article) to reduce noise on cluttered pages. |
jsonSchema | object | no | JSON Schema describing the fields you want extracted. When present, the response's aiAnalysis will match this shape. Omit it if you only want the raw page text. |
Example
requestPOST /api/scrape/9hmfPQYy...0hefEB HTTP/1.1
Host: crawlai.io
Content-Type: application/json
{
"url": "https://boei.help",
"selector": "body",
"jsonSchema": {
"type": "object",
"properties": {
"country": { "type": "string", "description": "Country where the company is based" },
"industry": { "type": "string", "description": "Industry of the company" },
"language": { "type": "string", "description": "Primary language of the website" },
"contactInfo": {
"type": "object",
"properties": {
"email": { "type": "string", "description": "Contact email" }
}
}
}
}
}
response 200
{
"success": true,
"data": {
"url": "https://boei.help",
"finalUrl": "https://boei.help/",
"title": "Boei - Capture Leads and Engage Visitors with AI",
"metaDescription": "Your website can be a lead-generating hero...",
"statusCode": 200,
"aiAnalysis": {
"country": "Netherlands",
"industry": "Lead Generation",
"language": "English",
"contactInfo": { "email": "support@boei.help" }
},
"timestamp": "2026-05-12T11:46:19.349Z"
},
"remaining_calls": 62
}
jsonSchema if you only need the page text. The response will still include title, metaDescription, finalUrl, etc., and aiAnalysis will be null.
GET/api/info/{token}
Returns details about your account, including credit usage. Does not consume a credit.
response 200{
"customer": {
"name": "Acme Inc",
"email": "you@example.com",
"call_limit": 1000,
"limit_type": "monthly",
"calls_used": 38,
"remaining_calls": 962,
"last_reset_at": "2026-05-01T00:00:00.000000Z"
}
}
GET/api/health
Public health check. Useful for status pages and uptime monitors.
response 200{ "status": "ok", "timestamp": "2026-05-12T11:46:02.928157Z" }
Response format
Successful scrape responses always return success: true with the following fields inside data:
| Field | Type | Description |
|---|---|---|
url | string | The URL you submitted. |
finalUrl | string | The URL after following redirects. |
title | string | Page <title>. |
metaDescription | string | Content of the meta description tag, if any. |
statusCode | integer | HTTP status code returned by the target site. |
content | string | Cleaned, plain-text content of the page. |
aiAnalysis | object | null | Structured data extracted by GPT-5, matching the shape of the jsonSchema you sent. null if no schema was provided. |
timestamp | string | ISO 8601 timestamp when the scrape completed. |
The top-level response also includes remaining_calls, an integer with your remaining credits after this request.
Error codes
| Status | Meaning | What to do |
|---|---|---|
401 | Invalid API token | Check the token in your URL. Tokens are case-sensitive. |
429 | Limit exceeded | You're out of credits. Upgrade your plan or wait for monthly reset. |
502 | Backend error | The target site or upstream returned an error. The call is refunded. |
503 | Service unavailable | Transient gateway error. Retry with exponential backoff. The call is refunded. |
Rate limits
There is no per-second rate limit. The only limit is your credit balance.
- One-time plans grant a fixed bucket of credits that never resets.
- Monthly plans reset to your full call limit once per calendar month.
Each successful scrape consumes one credit. Failed scrapes (502/503) are automatically refunded.
Examples
A minimal "hello world" call in each language. Same request, same response.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://boei.help",
"jsonSchema": {
"type": "object",
"properties": {
"industry": { "type": "string", "description": "Industry of the company" },
"country": { "type": "string", "description": "Country where the company is based" }
}
}
}'
const schema = {
type: 'object',
properties: {
industry: { type: 'string', description: 'Industry of the company' },
country: { type: 'string', description: 'Country where the company is based' },
},
};
const res = await fetch(`https://crawlai.io/api/scrape/${process.env.CRAWLAI_TOKEN}`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ url: 'https://boei.help', jsonSchema: schema }),
});
const { data, remaining_calls } = await res.json();
console.log(data.aiAnalysis, '|', remaining_calls, 'calls left');
import os, requests
token = os.environ["CRAWLAI_TOKEN"]
schema = {
"type": "object",
"properties": {
"industry": {"type": "string", "description": "Industry of the company"},
"country": {"type": "string", "description": "Country where the company is based"},
},
}
r = requests.post(
f"https://crawlai.io/api/scrape/{token}",
json={"url": "https://boei.help", "jsonSchema": schema},
timeout=60,
)
r.raise_for_status()
body = r.json()
print(body["data"]["aiAnalysis"], "|", body["remaining_calls"], "left")
$token = getenv('CRAWLAI_TOKEN');
$schema = [
'type' => 'object',
'properties' => [
'industry' => ['type' => 'string', 'description' => 'Industry of the company'],
'country' => ['type' => 'string', 'description' => 'Country where the company is based'],
],
];
$res = Http::timeout(60)->post("https://crawlai.io/api/scrape/{$token}", [
'url' => 'https://boei.help',
'jsonSchema' => $schema,
])->json();
print_r($res['data']['aiAnalysis']);
echo $res['remaining_calls'].' left';
Checking remaining credits
curl https://crawlai.io/api/info/YOUR_TOKEN