URL to Markdown: Clean Page Text for LLMs, RAG, and Archives

A URL is not a useful input for a language model. The model wants text. Specifically, it wants the readable body of the page, free of navigation, ads, sidebars, cookie banners, and the other furniture that surrounds the actual content.

Going from a URL to that clean text is the unglamorous prep step behind every RAG pipeline, every "summarise this article" feature, and every long-term web archive. This guide covers how to do it with CrawlAI, what the output actually looks like, and when a different tool is the better choice.

What you actually need

When people say "URL to markdown", they usually mean one of three things.

  1. Clean body text for an LLM prompt. No formatting required, just the words.
  2. Structured markdown with headings and links preserved, for human reading or for a static archive.
  3. Chunked, embedded text for a RAG index. Often closer to plain text than to formatted markdown.

CrawlAI handles cases one and three out of the box. For case two, where the markdown formatting itself matters, a tool that produces markdown natively is a better fit. More on that below.

What CrawlAI returns

CrawlAI's response envelope includes a data.content field. This is the cleaned page text. The service loads the URL in a headless browser, renders JavaScript, strips navigation, scripts, and styles, and returns what is left as readable text.

The output is not formatted markdown. There are no # headings, no [link](url) syntax, no bullet lists. It is the visible content of the page, in reading order, ready to feed to an LLM or chunk into a vector store.

If your downstream system wants plain text, this is exactly the right format. If your downstream system wants polished markdown, you have a small extra step.

A working example

Here is the simplest possible call. One URL in, cleaned content out.

curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post-slug",
    "selector": "main"
  }'

The response.

{
  "success": true,
  "data": {
    "title": "Why we rewrote our pipeline in Rust",
    "finalUrl": "https://example.com/blog/post-slug",
    "statusCode": 200,
    "metaDescription": "A retrospective on a six month migration project.",
    "content": "Why we rewrote our pipeline in Rust. Last spring we decided...",
    "aiAnalysis": null
  },
  "remaining_calls": 999
}

A few things worth noting.

The full envelope, with every field and every error code, is documented in the docs.

When CrawlAI is the right tool

CrawlAI fits well for these jobs.

The single-URL design is the point. There is no /crawl endpoint, no link discovery, no map. If you bring the URL list, CrawlAI returns clean content and optional structured JSON, fast.

When something else is the right tool

Be honest about the alternatives.

If you specifically want markdown-formatted output, where headings remain as headings and links remain as links, Firecrawl ships markdown as a default output format. CrawlAI returns plain text, which is closer to what an LLM prompt wants but further from what a human reader wants. The Firecrawl alternative page covers the head-to-head in detail.

If you need to discover URLs as well as fetch them, neither tool is a one-stop fit unless you use Firecrawl's /crawl endpoint. CrawlAI's design assumes you bring the URLs. That assumption is wrong for some workflows (full-site documentation ingestion, for example) and right for others (lead enrichment, monitored sources, partner feeds).

If you want a self-hosted option, neither is what you want for CrawlAI. CrawlAI is hosted only. Firecrawl is open source. For an open-source Python route specifically, the Crawl4AI vs CrawlAI post covers that path.

Pick the tool whose default scope matches your problem.

Feeding the output to an LLM

The most common downstream use of clean URL content is a prompt. Here is the pattern in pseudocode.

const scrape = await fetch(`https://crawlai.io/api/scrape/${token}`, {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    url: "https://example.com/article",
    selector: "article"
  })
}).then(r => r.json());

const prompt = `
Summarise the following article in three bullet points.

TITLE: ${scrape.data.title}
URL: ${scrape.data.finalUrl}

CONTENT:
${scrape.data.content}
`;

const summary = await callYourLLM(prompt);

That is the entire pipeline. Fetch, format, prompt. The cleaned content is short enough to fit in a modern context window for most pages, and the title and URL give the model the framing metadata it needs.

For a structured version of the same pattern (extracting specific fields instead of summarising), the GPT-5 extraction tutorial walks through schemas for articles, products, and contact pages.

Building a RAG index

The other big use case is feeding many pages into a vector store. The shape is the same as for structured extraction. You hold the URL list, you call the API, you process the result.

A typical loop.

  1. For each URL in your list, call POST /api/scrape/{token} with selector: "main" (or whatever narrows to the article body on the source).
  2. Take data.content, chunk it into 500 to 1500 token segments.
  3. Embed each chunk.
  4. Store the chunks alongside data.title, data.finalUrl, and data.metaDescription as metadata.
  5. At query time, retrieve top-K chunks and pass them to the LLM with the user question.

The URL to LLM context guide covers this pattern with chunking strategies and metadata schemas in more detail.

Plain text vs markdown: does it matter?

For LLMs, mostly no. A modern model reads plain text fine, and the small amount of structure markdown adds (headings, lists) does not change retrieval quality in practice. If you are feeding content into a prompt, plain text is fine.

For humans and for archives, yes. Polished markdown is easier to read, easier to diff, and easier to convert to other formats later. If you are building a personal read-later archive or a documentation mirror, markdown formatting pulls its weight.

Here is the same idea as a quick comparison.

Use case Plain text (CrawlAI) Markdown (Firecrawl)
LLM prompt context Fits naturally Also fine, slightly more tokens
RAG embedding Ideal, less noise per chunk Works, headings sometimes help retrieval
Human-readable archive Loses structure Preserves structure
Static site mirror Awkward Natural
Mixed structured + text output One call returns both Two calls or one heavier call

There is no universal winner. Pick the format that matches the destination.

What about converting plain text to markdown yourself?

If you are mostly happy with CrawlAI's output but want markdown for a specific job, an LLM can reformat plain content into clean markdown in one extra call. Pass the data.content to GPT-5 with a prompt like "reformat the following article as clean markdown with appropriate headings and lists, preserving the original wording", and you get a respectable result.

This is fine for one-off jobs. For systematic markdown pipelines, a tool that produces markdown natively saves the extra round trip.

Where to go from here

If your job is "I have URLs, I want clean text for an LLM or a RAG index", CrawlAI is a small, predictable answer. One endpoint, three fields, structured envelope, ready to chain into the rest of your pipeline.

If your job leans heavily on the markdown format itself, the Firecrawl alternative page is the honest comparison.

For the broader story of how LLM-driven web extraction changes scraping pipelines, the main AI scraping guide covers the strategic picture. For the structured-extraction sibling of this post, the HTML to JSON guide walks through schema design.

The full API contract, including every response field and language examples in cURL, JavaScript, Python, and PHP, lives in the documentation.

Try CrawlAI

Turn any URL into structured JSON with your own schema, powered by GPT-5. Pay-as-you-go starts at $10.