URL to LLM Context: Building RAG Pipelines from Web Pages
Retrieval-augmented generation lives or dies on the quality of the context you feed it. Garbage in, confidently wrong out. If your RAG app pulls from a knowledge base sourced from web pages, the ingestion step matters more than the prompt.
This post walks through turning a URL into clean, chunkable, embeddable context. It covers the use cases that benefit most, how to get clean content from CrawlAI, how to chunk and embed it, and how it compares to alternatives like Firecrawl for the same job.
Why ingest URLs at all
A lot of useful knowledge lives on the public web. Product documentation. Pricing pages. Help centres. Vendor data sheets. Regulatory filings. Personal blog posts your sales team keeps quoting at customers.
Three common RAG workflows depend on URL ingestion:
- Company knowledge bases. Internal search across your own marketing site, blog, and docs, plus selected external references. Employees ask questions, the system answers with cited pages.
- Customer support. A bot grounded in your help centre and product docs. When the docs change, ingestion runs and the bot stays current without retraining.
- In-context search and analysis. Sales tools that pull a prospect's careers page, pricing, and recent blog posts and feed the highlights into a summarisation prompt.
The pattern is the same across all of them. Fetch the URL, get clean content, chunk it, embed each chunk, store it. At query time, retrieve the most relevant chunks and pass them to the model.
The fetch step is where most pipelines get sloppy.
What "clean content" actually requires
A naive fetch gets you a noisy HTML blob full of:
- Navigation menus and footers, repeated on every page.
- Cookie banners and consent overlays.
- Sidebar widgets, related posts, comment threads.
- Ad scripts, tracking pixels, hidden SEO text.
- JavaScript-rendered placeholders if the site is a single-page app.
If you embed that blob, every chunk gets contaminated. Retrieval starts surfacing footer text and cookie disclaimers instead of the article body. Answers go sideways.
Clean content means: the main article or document body, in plain text or markdown, with the boilerplate removed and the JavaScript already executed.
Getting clean content from CrawlAI
CrawlAI returns the cleaned page text in data.content on every request. You also get a structured aiAnalysis object if you supplied a JSON schema. For RAG ingestion you mostly care about content.
A minimal request:
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/docs/getting-started",
"selector": "main",
"jsonSchema": {
"type": "object",
"properties": {
"title": { "type": "string", "description": "Page title" },
"summary": { "type": "string", "description": "One-sentence summary of the page" }
}
}
}'
The response carries the fields you need for ingestion:
{
"success": true,
"data": {
"title": "Getting Started",
"finalUrl": "https://example.com/docs/getting-started",
"statusCode": 200,
"metaDescription": "How to set up your first project in under five minutes.",
"content": "Getting Started\nThis guide walks you through...",
"aiAnalysis": {
"title": "Getting Started",
"summary": "Walkthrough for setting up a first project in under five minutes."
}
},
"remaining_calls": 999
}
The selector parameter is doing real work here. Pointing it at main (or article, or a more specific selector if you have one) trims the navigation, footer, and sidebar before extraction. Cleaner input means cleaner chunks and better retrieval. If you do not know the selector, leave it as body and let the AI extraction do more of the cleanup.
For pages that need JavaScript rendering, you get it for free. CrawlAI runs a headless browser internally, so single-page apps work without extra configuration. The headless browser scraping post covers why this matters and what the alternatives look like.
Honest comparison with Firecrawl for RAG
If your goal is RAG and nothing else, Firecrawl has a real advantage: it returns polished markdown by default. Markdown preserves headings, lists, code blocks, and links, all of which help downstream chunkers split content along semantic boundaries.
CrawlAI returns plain text. It is clean text (whitespace normalised, scripts stripped, JavaScript rendered) but it does not preserve heading levels as # markers. For chunkers that rely on markdown structure, that is a real gap.
Trade-offs to weigh:
- If markdown is non-negotiable and you do not need the strict per-page JSON schema extraction, Firecrawl is the more natural fit. See the Firecrawl comparison for the full breakdown.
-
If you need both clean content and structured fields (title, summary, category, key entities) from the same URL, CrawlAI's combination of
data.contentplusdata.aiAnalysisis hard to beat. - If markdown specifically is what you want from a URL without the RAG pipeline around it, the url-to-markdown post covers the options including a Firecrawl path and a CrawlAI plus post-processing path.
A reasonable hybrid is to use CrawlAI for URLs where you also need structured extraction (lead enrichment, classification, content tagging) and use Firecrawl where pure markdown for RAG is the entire job. There is no rule that says you have to pick one vendor.
A RAG ingestion pipeline outline
The shape of a working pipeline:
- Maintain a list of URLs (a sitemap, a queue, a CSV).
- For each URL, call CrawlAI and pull
data.content,data.title, anddata.metaDescription. - Chunk the content into 500 to 1,500 token windows with 50 to 200 token overlap.
- Embed each chunk with your model of choice (OpenAI
text-embedding-3-small, Cohere, a self-hosted model). - Store chunk text, embedding, and source URL in a vector store (pgvector, Qdrant, Pinecone, Weaviate).
- At query time, embed the user question, retrieve top-k chunks, build a prompt with the chunks plus the question, and send it to GPT-5.
A minimal Python sketch:
import os
import requests
import tiktoken
from openai import OpenAI
CRAWLAI_TOKEN = os.environ["CRAWLAI_TOKEN"]
client = OpenAI()
encoder = tiktoken.encoding_for_model("gpt-5")
def fetch_content(url: str) -> dict:
r = requests.post(
f"https://crawlai.io/api/scrape/{CRAWLAI_TOKEN}",
json={
"url": url,
"selector": "main",
"jsonSchema": {
"type": "object",
"properties": {
"title": {"type": "string", "description": "Page title"},
"summary": {"type": "string", "description": "One-sentence summary"},
},
},
},
timeout=60,
)
r.raise_for_status()
return r.json()["data"]
def chunk(text: str, size: int = 800, overlap: int = 100) -> list[str]:
tokens = encoder.encode(text)
out = []
i = 0
while i < len(tokens):
window = tokens[i : i + size]
out.append(encoder.decode(window))
i += size - overlap
return out
def embed(texts: list[str]) -> list[list[float]]:
resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
return [d.embedding for d in resp.data]
def ingest(url: str, store):
data = fetch_content(url)
chunks = chunk(data["content"])
embeddings = embed(chunks)
for text, vector in zip(chunks, embeddings):
store.add(
text=text,
embedding=vector,
metadata={
"url": data["finalUrl"],
"title": data["title"],
"summary": data["aiAnalysis"].get("summary"),
},
)
The store.add call is whatever your vector database wants. The rest is portable.
At query time:
def answer(question: str, store) -> str:
q_emb = embed([question])[0]
hits = store.search(q_emb, top_k=5)
context = "\n\n".join(f"[{h['title']}]({h['url']})\n{h['text']}" for h in hits)
prompt = f"Use the context below to answer.\n\nContext:\n{context}\n\nQuestion: {question}"
resp = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}],
)
return resp.choices[0].message.content
This is intentionally bare. Real systems add reranking, query rewriting, citation formatting, and freshness checks. The bones do not change.
Practical chunking tips
A few things that matter more than people expect:
-
Chunk on paragraphs first, then merge to size. Splitting mid-sentence destroys retrieval quality. Split on
\n\n, then greedily combine paragraphs until you reach your token budget. - Preserve the page title in every chunk's metadata. When the model writes citations, you want a human-readable source, not a UUID.
- Deduplicate aggressively. Many sites repeat the same boilerplate (cookie text, footer disclaimers) across hundreds of pages. Hash chunks and drop duplicates before embedding to save cost and improve retrieval.
- Re-ingest on a schedule. Marketing pages change. Pricing changes. Docs get rewritten. A weekly or daily refresh keeps your index aligned with reality. CrawlAI is stateless, so re-ingestion is just running the same script again.
Where to go next
If you want a wider view of the schema-driven approach behind data.aiAnalysis, the main guide covers the philosophy and trade-offs. If markdown specifically is what you need, the url-to-markdown post compares options including Firecrawl. The Firecrawl alternative comparison covers the broader product fit between the two tools.
To see the full API contract (every field, every error code, language examples), the documentation is the source of truth.
Turning a URL into clean LLM context is not glamorous work. It is the difference between a RAG app that quotes your docs and one that quotes your footer. Spend the time on ingestion. The prompt engineering you save later is more than worth it.
Try CrawlAI
Turn any URL into structured JSON with your own schema, powered by GPT-5. Pay-as-you-go starts at $10.