Documentation

CrawlAI is scrape + AI in one call. You give it a URL and a JSON schema describing what you want, and it returns that exact structure, filled in by GPT-5 from the page contents. No selectors, no parsing, no per-site code.

Overview

One POST per page. The request takes three things:

  • url — the page to crawl.
  • selector — CSS selector for the region of the page to feed the AI (defaults to body).
  • jsonSchema — a JSON Schema describing the fields you want extracted.

The response includes the usual page metadata (title, meta description, final URL, status code) plus an aiAnalysis object that matches your schema. One credit = one scrape + one AI extraction.

If you skip jsonSchema, you'll get the page metadata but aiAnalysis will be null.

Base URL:

https://crawlai.io

Authentication

Every request is authenticated by passing your API token as part of the URL path. There are no headers, cookies, or OAuth flows. Treat the token like a password and don't commit it to public repositories.

https://crawlai.io/api/scrape/{your_api_token}
Don't have a token yet? Get one on the pricing page, or reach out if you need a custom plan.

Quickstart

Extract a company's country, industry, and contact email from their homepage in a single call:

curl
curl -X POST https://crawlai.io/api/scrape/YOUR_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://boei.help",
    "selector": "body",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "country":  { "type": "string", "description": "Country where the company is based" },
        "industry": { "type": "string", "description": "Industry of the company" },
        "contactInfo": {
          "type": "object",
          "properties": {
            "email": { "type": "string", "description": "Contact email" }
          }
        }
      }
    }
  }'

The aiAnalysis field in the response will mirror the shape of your schema with the extracted values filled in.

What you can build

The whole point of CrawlAI is that you don't write parsers. You describe the shape of the data you want and GPT-5 pulls it out of the page. Each recipe below is one API call with a different jsonSchema. Pick your language with the tabs — your choice is remembered across all blocks.

Lead enrichment

Take a list of company domains and turn each one into a structured row: industry, country, value proposition, contact email. Drop it straight into your CRM. No "if Shopify do X, if Webflow do Y" parsing.

curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://acme.com",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "industry":  { "type": "string", "description": "Industry of the company" },
        "country":   { "type": "string", "description": "Country where the company is based" },
        "valueProp": { "type": "string", "description": "One-sentence summary of what they sell" },
        "email":     { "type": "string", "description": "Contact email" }
      }
    }
  }'
const schema = {
  type: 'object',
  properties: {
    industry:  { type: 'string', description: 'Industry of the company' },
    country:   { type: 'string', description: 'Country where the company is based' },
    valueProp: { type: 'string', description: 'One-sentence summary of what they sell' },
    email:     { type: 'string', description: 'Contact email' },
  },
};

const { data } = await fetch(`https://crawlai.io/api/scrape/${TOKEN}`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ url: `https://${domain}`, jsonSchema: schema }),
}).then(r => r.json());

await crm.update(leadId, data.aiAnalysis);
schema = {
    "type": "object",
    "properties": {
        "industry":  {"type": "string", "description": "Industry of the company"},
        "country":   {"type": "string", "description": "Country where the company is based"},
        "valueProp": {"type": "string", "description": "One-sentence summary of what they sell"},
        "email":     {"type": "string", "description": "Contact email"},
    },
}

r = requests.post(
    f"https://crawlai.io/api/scrape/{token}",
    json={"url": f"https://{domain}", "jsonSchema": schema},
).json()

crm.update(lead_id, r["data"]["aiAnalysis"])
$schema = [
    'type' => 'object',
    'properties' => [
        'industry'  => ['type' => 'string', 'description' => 'Industry of the company'],
        'country'   => ['type' => 'string', 'description' => 'Country where the company is based'],
        'valueProp' => ['type' => 'string', 'description' => 'One-sentence summary of what they sell'],
        'email'     => ['type' => 'string', 'description' => 'Contact email'],
    ],
];

$res = Http::post("https://crawlai.io/api/scrape/{$token}", [
    'url'        => "https://{$domain}",
    'jsonSchema' => $schema,
])->json();

$crm->update($leadId, $res['data']['aiAnalysis']);

Competitor monitoring

Scrape a competitor's pricing or product pages on a schedule and extract the bits that actually matter (plan names, prices, features). Diff the structured output between runs to alert on real changes, not cosmetic ones.

curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://competitor.com/pricing",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "plans": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name":     { "type": "string" },
              "price":    { "type": "string", "description": "Price as shown" },
              "features": { "type": "array", "items": { "type": "string" } }
            }
          }
        }
      }
    }
  }'
const schema = {
  type: 'object',
  properties: {
    plans: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          name:     { type: 'string' },
          price:    { type: 'string', description: 'Price as shown' },
          features: { type: 'array', items: { type: 'string' } },
        },
      },
    },
  },
};

const today = (await scrape({ url: 'https://competitor.com/pricing', jsonSchema: schema })).data.aiAnalysis;
const prev  = await db.load('competitor_pricing');
if (JSON.stringify(today) !== JSON.stringify(prev)) {
  notifySlack('Competitor pricing changed', today);
  await db.save('competitor_pricing', today);
}
schema = {
    "type": "object",
    "properties": {
        "plans": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name":     {"type": "string"},
                    "price":    {"type": "string", "description": "Price as shown"},
                    "features": {"type": "array", "items": {"type": "string"}},
                },
            },
        },
    },
}

today = scrape("https://competitor.com/pricing", schema)["data"]["aiAnalysis"]
if today != db.load("competitor_pricing"):
    notify_slack("Competitor pricing changed", today)
    db.save("competitor_pricing", today)
$schema = [
    'type' => 'object',
    'properties' => [
        'plans' => [
            'type'  => 'array',
            'items' => [
                'type' => 'object',
                'properties' => [
                    'name'     => ['type' => 'string'],
                    'price'    => ['type' => 'string', 'description' => 'Price as shown'],
                    'features' => ['type' => 'array', 'items' => ['type' => 'string']],
                ],
            ],
        ],
    ],
];

$today = scrape('https://competitor.com/pricing', $schema)['data']['aiAnalysis'];
if ($today != Cache::get('competitor_pricing')) {
    notifySlack('Competitor pricing changed', $today);
    Cache::put('competitor_pricing', $today);
}

Market research

Pass a list of company URLs through the same schema to build a comparable dataset (positioning, target segment, headline features) across an entire market in minutes instead of weeks.

# Run once per competitor URL, then aggregate the results.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://competitor.com",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "positioning":      { "type": "string", "description": "How the company positions itself" },
        "targetMarket":     { "type": "string", "description": "Who their product is for" },
        "headlineFeatures": { "type": "array", "items": { "type": "string" } }
      }
    }
  }'
const schema = {
  type: 'object',
  properties: {
    positioning:      { type: 'string', description: 'How the company positions itself' },
    targetMarket:     { type: 'string', description: 'Who their product is for' },
    headlineFeatures: { type: 'array', items: { type: 'string' } },
  },
};

const rows = await Promise.all(
  competitorUrls.map(async url =>
    (await scrape({ url, jsonSchema: schema })).data.aiAnalysis
  )
);
writeCsv('market_landscape.csv', rows);
schema = {
    "type": "object",
    "properties": {
        "positioning":      {"type": "string", "description": "How the company positions itself"},
        "targetMarket":     {"type": "string", "description": "Who their product is for"},
        "headlineFeatures": {"type": "array", "items": {"type": "string"}},
    },
}

rows = [scrape(u, schema)["data"]["aiAnalysis"] for u in competitor_urls]
write_csv("market_landscape.csv", rows)
$schema = [
    'type' => 'object',
    'properties' => [
        'positioning'      => ['type' => 'string', 'description' => 'How the company positions itself'],
        'targetMarket'     => ['type' => 'string', 'description' => 'Who their product is for'],
        'headlineFeatures' => ['type' => 'array', 'items' => ['type' => 'string']],
    ],
];

$rows = array_map(
    fn($u) => scrape($u, $schema)['data']['aiAnalysis'],
    $competitorUrls
);
writeCsv('market_landscape.csv', $rows);

Classification & tagging

Have a pile of URLs you don't know much about? Define a schema with an enum of categories and CrawlAI will classify each one with reasoning attached. Way more accurate than guessing from the domain.

curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "pageType": {
          "type": "string",
          "enum": ["homepage", "pricing", "blog post", "product page", "docs", "other"]
        },
        "businessCategory": {
          "type": "string",
          "enum": ["saas", "ecommerce", "agency", "media", "nonprofit", "other"]
        },
        "reasoning": { "type": "string", "description": "Brief justification" }
      }
    }
  }'
const schema = {
  type: 'object',
  properties: {
    pageType: {
      type: 'string',
      enum: ['homepage', 'pricing', 'blog post', 'product page', 'docs', 'other'],
    },
    businessCategory: {
      type: 'string',
      enum: ['saas', 'ecommerce', 'agency', 'media', 'nonprofit', 'other'],
    },
    reasoning: { type: 'string', description: 'Brief justification' },
  },
};

const { aiAnalysis } = (await scrape({ url, jsonSchema: schema })).data;
// => { pageType: 'pricing', businessCategory: 'saas', reasoning: '...' }
schema = {
    "type": "object",
    "properties": {
        "pageType": {
            "type": "string",
            "enum": ["homepage", "pricing", "blog post", "product page", "docs", "other"],
        },
        "businessCategory": {
            "type": "string",
            "enum": ["saas", "ecommerce", "agency", "media", "nonprofit", "other"],
        },
        "reasoning": {"type": "string", "description": "Brief justification"},
    },
}

ai = scrape(url, schema)["data"]["aiAnalysis"]
# => {"pageType": "pricing", "businessCategory": "saas", "reasoning": "..."}
$schema = [
    'type' => 'object',
    'properties' => [
        'pageType' => [
            'type' => 'string',
            'enum' => ['homepage', 'pricing', 'blog post', 'product page', 'docs', 'other'],
        ],
        'businessCategory' => [
            'type' => 'string',
            'enum' => ['saas', 'ecommerce', 'agency', 'media', 'nonprofit', 'other'],
        ],
        'reasoning' => ['type' => 'string', 'description' => 'Brief justification'],
    ],
];

$ai = scrape($url, $schema)['data']['aiAnalysis'];
// => ['pageType' => 'pricing', 'businessCategory' => 'saas', 'reasoning' => '...']

Structured extraction from any page

Job listings, event pages, product detail pages, real estate listings — anywhere you'd normally write a custom scraper, just describe the fields and let the AI do it.

curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://jobs.example.com/listing/123",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "title":    { "type": "string" },
        "company":  { "type": "string" },
        "location": { "type": "string" },
        "remote":   { "type": "boolean", "description": "Whether the role is remote" },
        "salary":   { "type": "string", "description": "Salary range as shown, or empty" },
        "skills":   { "type": "array", "items": { "type": "string" } }
      }
    }
  }'
const schema = {
  type: 'object',
  properties: {
    title:    { type: 'string' },
    company:  { type: 'string' },
    location: { type: 'string' },
    remote:   { type: 'boolean', description: 'Whether the role is remote' },
    salary:   { type: 'string', description: 'Salary range as shown, or empty' },
    skills:   { type: 'array', items: { type: 'string' } },
  },
};

const { aiAnalysis } = (await scrape({ url: jobUrl, jsonSchema: schema })).data;
await db.jobs.insert(aiAnalysis);
schema = {
    "type": "object",
    "properties": {
        "title":    {"type": "string"},
        "company":  {"type": "string"},
        "location": {"type": "string"},
        "remote":   {"type": "boolean", "description": "Whether the role is remote"},
        "salary":   {"type": "string", "description": "Salary range as shown, or empty"},
        "skills":   {"type": "array", "items": {"type": "string"}},
    },
}

ai = scrape(job_url, schema)["data"]["aiAnalysis"]
db.jobs.insert(ai)
$schema = [
    'type' => 'object',
    'properties' => [
        'title'    => ['type' => 'string'],
        'company'  => ['type' => 'string'],
        'location' => ['type' => 'string'],
        'remote'   => ['type' => 'boolean', 'description' => 'Whether the role is remote'],
        'salary'   => ['type' => 'string', 'description' => 'Salary range as shown, or empty'],
        'skills'   => ['type' => 'array', 'items' => ['type' => 'string']],
    ],
];

$ai = scrape($jobUrl, $schema)['data']['aiAnalysis'];
DB::table('jobs')->insert($ai);

Feeding RAG / LLM pipelines

If you only want clean page text for embeddings, skip jsonSchema and use the content field. The page is already de-boilerplated and JS-rendered, so it drops straight into an embedding pipeline.

# No jsonSchema -> response.data.content is the clean page text.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'
const { data } = await fetch(`https://crawlai.io/api/scrape/${TOKEN}`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ url }),
}).then(r => r.json());

const embeddings = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: chunk(data.content, 1000),
});
await vectorDb.upsert(embeddings, { source: data.finalUrl, title: data.title });
data = requests.post(
    f"https://crawlai.io/api/scrape/{token}",
    json={"url": url},
).json()["data"]

embeddings = openai.embeddings.create(
    model="text-embedding-3-small",
    input=chunk(data["content"], 1000),
)
vector_db.upsert(embeddings, source=data["finalUrl"], title=data["title"])
$data = Http::post("https://crawlai.io/api/scrape/{$token}", ['url' => $url])
    ->json()['data'];

$embeddings = $openai->embeddings([
    'model' => 'text-embedding-3-small',
    'input' => chunkText($data['content'], 1000),
]);
$vectorDb->upsert($embeddings, ['source' => $data['finalUrl'], 'title' => $data['title']]);
Tip: the more descriptive your description fields are, the better the extraction. Treat them as instructions to GPT-5, not just labels.

POST/api/scrape/{token}

Scrape a single URL and optionally run GPT-5 extraction against a JSON schema you supply. Consumes one credit per successful call. Failed calls (4xx, 5xx) are automatically refunded.

Request body

FieldTypeRequiredDescription
urlstringyesThe fully qualified URL to scrape (must include scheme).
selectorstringnoCSS selector for the region to feed the AI. Defaults to body. Narrow it (e.g. main, article) to reduce noise on cluttered pages.
jsonSchemaobjectnoJSON Schema describing the fields you want extracted. When present, the response's aiAnalysis will match this shape. Omit it if you only want the raw page text.

Example

request
POST /api/scrape/9hmfPQYy...0hefEB HTTP/1.1
Host: crawlai.io
Content-Type: application/json

{
  "url": "https://boei.help",
  "selector": "body",
  "jsonSchema": {
    "type": "object",
    "properties": {
      "country":  { "type": "string", "description": "Country where the company is based" },
      "industry": { "type": "string", "description": "Industry of the company" },
      "language": { "type": "string", "description": "Primary language of the website" },
      "contactInfo": {
        "type": "object",
        "properties": {
          "email": { "type": "string", "description": "Contact email" }
        }
      }
    }
  }
}
response 200
{
  "success": true,
  "data": {
    "url": "https://boei.help",
    "finalUrl": "https://boei.help/",
    "title": "Boei - Capture Leads and Engage Visitors with AI",
    "metaDescription": "Your website can be a lead-generating hero...",
    "statusCode": 200,
    "aiAnalysis": {
      "country": "Netherlands",
      "industry": "Lead Generation",
      "language": "English",
      "contactInfo": { "email": "support@boei.help" }
    },
    "timestamp": "2026-05-12T11:46:19.349Z"
  },
  "remaining_calls": 62
}
Skip jsonSchema if you only need the page text. The response will still include title, metaDescription, finalUrl, etc., and aiAnalysis will be null.

GET/api/info/{token}

Returns details about your account, including credit usage. Does not consume a credit.

response 200
{
  "customer": {
    "name": "Acme Inc",
    "email": "you@example.com",
    "call_limit": 1000,
    "limit_type": "monthly",
    "calls_used": 38,
    "remaining_calls": 962,
    "last_reset_at": "2026-05-01T00:00:00.000000Z"
  }
}

GET/api/health

Public health check. Useful for status pages and uptime monitors.

response 200
{ "status": "ok", "timestamp": "2026-05-12T11:46:02.928157Z" }

Response format

Successful scrape responses always return success: true with the following fields inside data:

FieldTypeDescription
urlstringThe URL you submitted.
finalUrlstringThe URL after following redirects.
titlestringPage <title>.
metaDescriptionstringContent of the meta description tag, if any.
statusCodeintegerHTTP status code returned by the target site.
contentstringCleaned, plain-text content of the page.
aiAnalysisobject | nullStructured data extracted by GPT-5, matching the shape of the jsonSchema you sent. null if no schema was provided.
timestampstringISO 8601 timestamp when the scrape completed.

The top-level response also includes remaining_calls, an integer with your remaining credits after this request.

Error codes

StatusMeaningWhat to do
401Invalid API tokenCheck the token in your URL. Tokens are case-sensitive.
429Limit exceededYou're out of credits. Upgrade your plan or wait for monthly reset.
502Backend errorThe target site or upstream returned an error. The call is refunded.
503Service unavailableTransient gateway error. Retry with exponential backoff. The call is refunded.

Rate limits

There is no per-second rate limit. The only limit is your credit balance.

  • One-time plans grant a fixed bucket of credits that never resets.
  • Monthly plans reset to your full call limit once per calendar month.

Each successful scrape consumes one credit. Failed scrapes (502/503) are automatically refunded.

Individual scrape requests have a backend timeout of roughly 45 seconds. Long-running JavaScript-heavy pages may occasionally exceed this and return a 503. Just retry.

Examples

A minimal "hello world" call in each language. Same request, same response.

curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://boei.help",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "industry": { "type": "string", "description": "Industry of the company" },
        "country":  { "type": "string", "description": "Country where the company is based" }
      }
    }
  }'
const schema = {
  type: 'object',
  properties: {
    industry: { type: 'string', description: 'Industry of the company' },
    country:  { type: 'string', description: 'Country where the company is based' },
  },
};

const res = await fetch(`https://crawlai.io/api/scrape/${process.env.CRAWLAI_TOKEN}`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ url: 'https://boei.help', jsonSchema: schema }),
});
const { data, remaining_calls } = await res.json();
console.log(data.aiAnalysis, '|', remaining_calls, 'calls left');
import os, requests

token = os.environ["CRAWLAI_TOKEN"]
schema = {
    "type": "object",
    "properties": {
        "industry": {"type": "string", "description": "Industry of the company"},
        "country":  {"type": "string", "description": "Country where the company is based"},
    },
}

r = requests.post(
    f"https://crawlai.io/api/scrape/{token}",
    json={"url": "https://boei.help", "jsonSchema": schema},
    timeout=60,
)
r.raise_for_status()
body = r.json()
print(body["data"]["aiAnalysis"], "|", body["remaining_calls"], "left")
$token  = getenv('CRAWLAI_TOKEN');
$schema = [
    'type' => 'object',
    'properties' => [
        'industry' => ['type' => 'string', 'description' => 'Industry of the company'],
        'country'  => ['type' => 'string', 'description' => 'Country where the company is based'],
    ],
];

$res = Http::timeout(60)->post("https://crawlai.io/api/scrape/{$token}", [
    'url'        => 'https://boei.help',
    'jsonSchema' => $schema,
])->json();

print_r($res['data']['aiAnalysis']);
echo $res['remaining_calls'].' left';

Checking remaining credits

curl https://crawlai.io/api/info/YOUR_TOKEN