URL to Markdown: Clean Page Text for LLMs, RAG, and Archives
A URL is not a useful input for a language model. The model wants text. Specifically, it wants the readable body of the page, free of navigation, ads, sidebars, cookie banners, and the other furniture that surrounds the actual content.
Going from a URL to that clean text is the unglamorous prep step behind every RAG pipeline, every "summarise this article" feature, and every long-term web archive. This guide covers how to do it with CrawlAI, what the output actually looks like, and when a different tool is the better choice.
What you actually need
When people say "URL to markdown", they usually mean one of three things.
- Clean body text for an LLM prompt. No formatting required, just the words.
- Structured markdown with headings and links preserved, for human reading or for a static archive.
- Chunked, embedded text for a RAG index. Often closer to plain text than to formatted markdown.
CrawlAI handles cases one and three out of the box. For case two, where the markdown formatting itself matters, a tool that produces markdown natively is a better fit. More on that below.
What CrawlAI returns
CrawlAI's response envelope includes a data.content field. This is the cleaned page text. The service loads the URL in a headless browser, renders JavaScript, strips navigation, scripts, and styles, and returns what is left as readable text.
The output is not formatted markdown. There are no # headings, no [link](url) syntax, no bullet lists. It is the visible content of the page, in reading order, ready to feed to an LLM or chunk into a vector store.
If your downstream system wants plain text, this is exactly the right format. If your downstream system wants polished markdown, you have a small extra step.
A working example
Here is the simplest possible call. One URL in, cleaned content out.
curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/post-slug",
"selector": "main"
}'
The response.
{
"success": true,
"data": {
"title": "Why we rewrote our pipeline in Rust",
"finalUrl": "https://example.com/blog/post-slug",
"statusCode": 200,
"metaDescription": "A retrospective on a six month migration project.",
"content": "Why we rewrote our pipeline in Rust. Last spring we decided...",
"aiAnalysis": null
},
"remaining_calls": 999
}
A few things worth noting.
- The
selectorfield narrows the cleaner to the part of the page you care about.main,article, or a tighter CSS path. The default isbody, which works for most pages but pulls in more chrome. -
aiAnalysisisnullbecause we did not supply ajsonSchema. The endpoint is happy to return just the cleaned content if that is all you need. -
titleandmetaDescriptioncome straight from the page's<title>tag and meta tags. These are excellent document metadata for a RAG index.
The full envelope, with every field and every error code, is documented in the docs.
When CrawlAI is the right tool
CrawlAI fits well for these jobs.
- Feeding a known URL into an LLM. You have a URL, you want the page text in a prompt window. One call, ready in seconds.
- Building a RAG index from a curated URL list. You know which pages you want. You loop, you scrape, you embed, you store.
-
Mixed jobs. You want clean content AND structured fields from the same page. Pass a
jsonSchemaand you get both in one response.
The single-URL design is the point. There is no /crawl endpoint, no link discovery, no map. If you bring the URL list, CrawlAI returns clean content and optional structured JSON, fast.
When something else is the right tool
Be honest about the alternatives.
If you specifically want markdown-formatted output, where headings remain as headings and links remain as links, Firecrawl ships markdown as a default output format. CrawlAI returns plain text, which is closer to what an LLM prompt wants but further from what a human reader wants. The Firecrawl alternative page covers the head-to-head in detail.
If you need to discover URLs as well as fetch them, neither tool is a one-stop fit unless you use Firecrawl's /crawl endpoint. CrawlAI's design assumes you bring the URLs. That assumption is wrong for some workflows (full-site documentation ingestion, for example) and right for others (lead enrichment, monitored sources, partner feeds).
If you want a self-hosted option, neither is what you want for CrawlAI. CrawlAI is hosted only. Firecrawl is open source. For an open-source Python route specifically, the Crawl4AI vs CrawlAI post covers that path.
Pick the tool whose default scope matches your problem.
Feeding the output to an LLM
The most common downstream use of clean URL content is a prompt. Here is the pattern in pseudocode.
const scrape = await fetch(`https://crawlai.io/api/scrape/${token}`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: "https://example.com/article",
selector: "article"
})
}).then(r => r.json());
const prompt = `
Summarise the following article in three bullet points.
TITLE: ${scrape.data.title}
URL: ${scrape.data.finalUrl}
CONTENT:
${scrape.data.content}
`;
const summary = await callYourLLM(prompt);
That is the entire pipeline. Fetch, format, prompt. The cleaned content is short enough to fit in a modern context window for most pages, and the title and URL give the model the framing metadata it needs.
For a structured version of the same pattern (extracting specific fields instead of summarising), the GPT-5 extraction tutorial walks through schemas for articles, products, and contact pages.
Building a RAG index
The other big use case is feeding many pages into a vector store. The shape is the same as for structured extraction. You hold the URL list, you call the API, you process the result.
A typical loop.
- For each URL in your list, call
POST /api/scrape/{token}withselector: "main"(or whatever narrows to the article body on the source). - Take
data.content, chunk it into 500 to 1500 token segments. - Embed each chunk.
- Store the chunks alongside
data.title,data.finalUrl, anddata.metaDescriptionas metadata. - At query time, retrieve top-K chunks and pass them to the LLM with the user question.
The URL to LLM context guide covers this pattern with chunking strategies and metadata schemas in more detail.
Plain text vs markdown: does it matter?
For LLMs, mostly no. A modern model reads plain text fine, and the small amount of structure markdown adds (headings, lists) does not change retrieval quality in practice. If you are feeding content into a prompt, plain text is fine.
For humans and for archives, yes. Polished markdown is easier to read, easier to diff, and easier to convert to other formats later. If you are building a personal read-later archive or a documentation mirror, markdown formatting pulls its weight.
Here is the same idea as a quick comparison.
| Use case | Plain text (CrawlAI) | Markdown (Firecrawl) |
|---|---|---|
| LLM prompt context | Fits naturally | Also fine, slightly more tokens |
| RAG embedding | Ideal, less noise per chunk | Works, headings sometimes help retrieval |
| Human-readable archive | Loses structure | Preserves structure |
| Static site mirror | Awkward | Natural |
| Mixed structured + text output | One call returns both | Two calls or one heavier call |
There is no universal winner. Pick the format that matches the destination.
What about converting plain text to markdown yourself?
If you are mostly happy with CrawlAI's output but want markdown for a specific job, an LLM can reformat plain content into clean markdown in one extra call. Pass the data.content to GPT-5 with a prompt like "reformat the following article as clean markdown with appropriate headings and lists, preserving the original wording", and you get a respectable result.
This is fine for one-off jobs. For systematic markdown pipelines, a tool that produces markdown natively saves the extra round trip.
Where to go from here
If your job is "I have URLs, I want clean text for an LLM or a RAG index", CrawlAI is a small, predictable answer. One endpoint, three fields, structured envelope, ready to chain into the rest of your pipeline.
If your job leans heavily on the markdown format itself, the Firecrawl alternative page is the honest comparison.
For the broader story of how LLM-driven web extraction changes scraping pipelines, the main AI scraping guide covers the strategic picture. For the structured-extraction sibling of this post, the HTML to JSON guide walks through schema design.
The full API contract, including every response field and language examples in cURL, JavaScript, Python, and PHP, lives in the documentation.
Try CrawlAI
Turn any URL into structured JSON with your own schema, powered by GPT-5. Pay-as-you-go starts at $10.