URL to LLM Context: Building RAG Pipelines from Web Pages

Retrieval-augmented generation lives or dies on the quality of the context you feed it. Garbage in, confidently wrong out. If your RAG app pulls from a knowledge base sourced from web pages, the ingestion step matters more than the prompt.

This post walks through turning a URL into clean, chunkable, embeddable context. It covers the use cases that benefit most, how to get clean content from CrawlAI, how to chunk and embed it, and how it compares to alternatives like Firecrawl for the same job.

Why ingest URLs at all

A lot of useful knowledge lives on the public web. Product documentation. Pricing pages. Help centres. Vendor data sheets. Regulatory filings. Personal blog posts your sales team keeps quoting at customers.

Three common RAG workflows depend on URL ingestion:

The pattern is the same across all of them. Fetch the URL, get clean content, chunk it, embed each chunk, store it. At query time, retrieve the most relevant chunks and pass them to the model.

The fetch step is where most pipelines get sloppy.

What "clean content" actually requires

A naive fetch gets you a noisy HTML blob full of:

If you embed that blob, every chunk gets contaminated. Retrieval starts surfacing footer text and cookie disclaimers instead of the article body. Answers go sideways.

Clean content means: the main article or document body, in plain text or markdown, with the boilerplate removed and the JavaScript already executed.

Getting clean content from CrawlAI

CrawlAI returns the cleaned page text in data.content on every request. You also get a structured aiAnalysis object if you supplied a JSON schema. For RAG ingestion you mostly care about content.

A minimal request:

curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/docs/getting-started",
    "selector": "main",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "title":   { "type": "string", "description": "Page title" },
        "summary": { "type": "string", "description": "One-sentence summary of the page" }
      }
    }
  }'

The response carries the fields you need for ingestion:

{
  "success": true,
  "data": {
    "title": "Getting Started",
    "finalUrl": "https://example.com/docs/getting-started",
    "statusCode": 200,
    "metaDescription": "How to set up your first project in under five minutes.",
    "content": "Getting Started\nThis guide walks you through...",
    "aiAnalysis": {
      "title": "Getting Started",
      "summary": "Walkthrough for setting up a first project in under five minutes."
    }
  },
  "remaining_calls": 999
}

The selector parameter is doing real work here. Pointing it at main (or article, or a more specific selector if you have one) trims the navigation, footer, and sidebar before extraction. Cleaner input means cleaner chunks and better retrieval. If you do not know the selector, leave it as body and let the AI extraction do more of the cleanup.

For pages that need JavaScript rendering, you get it for free. CrawlAI runs a headless browser internally, so single-page apps work without extra configuration. The headless browser scraping post covers why this matters and what the alternatives look like.

Honest comparison with Firecrawl for RAG

If your goal is RAG and nothing else, Firecrawl has a real advantage: it returns polished markdown by default. Markdown preserves headings, lists, code blocks, and links, all of which help downstream chunkers split content along semantic boundaries.

CrawlAI returns plain text. It is clean text (whitespace normalised, scripts stripped, JavaScript rendered) but it does not preserve heading levels as # markers. For chunkers that rely on markdown structure, that is a real gap.

Trade-offs to weigh:

A reasonable hybrid is to use CrawlAI for URLs where you also need structured extraction (lead enrichment, classification, content tagging) and use Firecrawl where pure markdown for RAG is the entire job. There is no rule that says you have to pick one vendor.

A RAG ingestion pipeline outline

The shape of a working pipeline:

  1. Maintain a list of URLs (a sitemap, a queue, a CSV).
  2. For each URL, call CrawlAI and pull data.content, data.title, and data.metaDescription.
  3. Chunk the content into 500 to 1,500 token windows with 50 to 200 token overlap.
  4. Embed each chunk with your model of choice (OpenAI text-embedding-3-small, Cohere, a self-hosted model).
  5. Store chunk text, embedding, and source URL in a vector store (pgvector, Qdrant, Pinecone, Weaviate).
  6. At query time, embed the user question, retrieve top-k chunks, build a prompt with the chunks plus the question, and send it to GPT-5.

A minimal Python sketch:

import os
import requests
import tiktoken
from openai import OpenAI

CRAWLAI_TOKEN = os.environ["CRAWLAI_TOKEN"]
client = OpenAI()
encoder = tiktoken.encoding_for_model("gpt-5")

def fetch_content(url: str) -> dict:
    r = requests.post(
        f"https://crawlai.io/api/scrape/{CRAWLAI_TOKEN}",
        json={
            "url": url,
            "selector": "main",
            "jsonSchema": {
                "type": "object",
                "properties": {
                    "title":   {"type": "string", "description": "Page title"},
                    "summary": {"type": "string", "description": "One-sentence summary"},
                },
            },
        },
        timeout=60,
    )
    r.raise_for_status()
    return r.json()["data"]

def chunk(text: str, size: int = 800, overlap: int = 100) -> list[str]:
    tokens = encoder.encode(text)
    out = []
    i = 0
    while i < len(tokens):
        window = tokens[i : i + size]
        out.append(encoder.decode(window))
        i += size - overlap
    return out

def embed(texts: list[str]) -> list[list[float]]:
    resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return [d.embedding for d in resp.data]

def ingest(url: str, store):
    data = fetch_content(url)
    chunks = chunk(data["content"])
    embeddings = embed(chunks)
    for text, vector in zip(chunks, embeddings):
        store.add(
            text=text,
            embedding=vector,
            metadata={
                "url": data["finalUrl"],
                "title": data["title"],
                "summary": data["aiAnalysis"].get("summary"),
            },
        )

The store.add call is whatever your vector database wants. The rest is portable.

At query time:

def answer(question: str, store) -> str:
    q_emb = embed([question])[0]
    hits = store.search(q_emb, top_k=5)
    context = "\n\n".join(f"[{h['title']}]({h['url']})\n{h['text']}" for h in hits)
    prompt = f"Use the context below to answer.\n\nContext:\n{context}\n\nQuestion: {question}"
    resp = client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.choices[0].message.content

This is intentionally bare. Real systems add reranking, query rewriting, citation formatting, and freshness checks. The bones do not change.

Practical chunking tips

A few things that matter more than people expect:

Where to go next

If you want a wider view of the schema-driven approach behind data.aiAnalysis, the main guide covers the philosophy and trade-offs. If markdown specifically is what you need, the url-to-markdown post compares options including Firecrawl. The Firecrawl alternative comparison covers the broader product fit between the two tools.

To see the full API contract (every field, every error code, language examples), the documentation is the source of truth.

Turning a URL into clean LLM context is not glamorous work. It is the difference between a RAG app that quotes your docs and one that quotes your footer. Spend the time on ingestion. The prompt engineering you save later is more than worth it.

Try CrawlAI

Turn any URL into structured JSON with your own schema, powered by GPT-5. Pay-as-you-go starts at $10.