Does CrawlAI return markdown?

CrawlAI returns the cleaned page content as plain text in the data.content field. It is not formatted as polished markdown with headings and links preserved. If you specifically need markdown-formatted output, Firecrawl ships that as a first-class feature.

Can I use CrawlAI for RAG pipelines?

Yes. The data.content field gives you the body text stripped of navigation and chrome, which is exactly what most RAG indexers want. Pair it with the data.title and data.metaDescription fields for document metadata.

How is this different from saving the raw HTML?

Raw HTML is full of navigation, scripts, styles, and other noise that wastes tokens and confuses retrieval. CrawlAI loads the page in a headless browser, runs the visible region through a cleaner, and returns the readable text only.

What if I need the page formatted as markdown specifically?

If preserved headings, links, and lists matter, Firecrawl's markdown output is more polished than CrawlAI's plain text. You can also feed CrawlAI's content field to an LLM with a 'reformat as markdown' instruction, but a dedicated tool is simpler.

Published March 22, 2026

URL to Markdown: Clean Page Text for LLMs, RAG, and Archives

A URL is not a useful input for a language model. The model wants text. Specifically, it wants the readable body of the page, free of navigation, ads, sidebars, cookie banners, and the other furniture that surrounds the actual content.

Going from a URL to that clean text is the unglamorous prep step behind every RAG pipeline, every "summarise this article" feature, and every long-term web archive. This guide covers how to do it with CrawlAI, what the output actually looks like, and when a different tool is the better choice.

What you actually need

When people say "URL to markdown", they usually mean one of three things.

Clean body text for an LLM prompt. No formatting required, just the words.
Structured markdown with headings and links preserved, for human reading or for a static archive.
Chunked, embedded text for a RAG index. Often closer to plain text than to formatted markdown.

CrawlAI handles cases one and three out of the box. For case two, where the markdown formatting itself matters, a tool that produces markdown natively is a better fit. More on that below.

What CrawlAI returns

CrawlAI's response envelope includes a data.content field. This is the cleaned page text. The service loads the URL in a headless browser, renders JavaScript, strips navigation, scripts, and styles, and returns what is left as readable text.

The output is not formatted markdown. There are no # headings, no [link](url) syntax, no bullet lists. It is the visible content of the page, in reading order, ready to feed to an LLM or chunk into a vector store.

If your downstream system wants plain text, this is exactly the right format. If your downstream system wants polished markdown, you have a small extra step.

A working example

Here is the simplest possible call. One URL in, cleaned content out.

curl -X POST https://crawlai.io/api/scrape/$CRAWLAI_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post-slug",
    "selector": "main"
  }'

The response.

{
  "success": true,
  "data": {
    "title": "Why we rewrote our pipeline in Rust",
    "finalUrl": "https://example.com/blog/post-slug",
    "statusCode": 200,
    "metaDescription": "A retrospective on a six month migration project.",
    "content": "Why we rewrote our pipeline in Rust. Last spring we decided...",
    "aiAnalysis": null
  },
  "remaining_calls": 999
}

A few things worth noting.

The selector field narrows the cleaner to the part of the page you care about. main, article, or a tighter CSS path. The default is body, which works for most pages but pulls in more chrome.
aiAnalysis is null because we did not supply a jsonSchema. The endpoint is happy to return just the cleaned content if that is all you need.
title and metaDescription come straight from the page's <title> tag and meta tags. These are excellent document metadata for a RAG index.

The full envelope, with every field and every error code, is documented in the docs.

When CrawlAI is the right tool

CrawlAI fits well for these jobs.

Feeding a known URL into an LLM. You have a URL, you want the page text in a prompt window. One call, ready in seconds.
Building a RAG index from a curated URL list. You know which pages you want. You loop, you scrape, you embed, you store.
Mixed jobs. You want clean content AND structured fields from the same page. Pass a jsonSchema and you get both in one response.

The single-URL design is the point. There is no /crawl endpoint, no link discovery, no map. If you bring the URL list, CrawlAI returns clean content and optional structured JSON, fast.

When something else is the right tool

Be honest about the alternatives.

If you specifically want markdown-formatted output, where headings remain as headings and links remain as links, Firecrawl ships markdown as a default output format. CrawlAI returns plain text, which is closer to what an LLM prompt wants but further from what a human reader wants. The Firecrawl alternative page covers the head-to-head in detail.

If you need to discover URLs as well as fetch them, neither tool is a one-stop fit unless you use Firecrawl's /crawl endpoint. CrawlAI's design assumes you bring the URLs. That assumption is wrong for some workflows (full-site documentation ingestion, for example) and right for others (lead enrichment, monitored sources, partner feeds).

If you want a self-hosted option, neither is what you want for CrawlAI. CrawlAI is hosted only. Firecrawl is open source. For an open-source Python route specifically, the Crawl4AI vs CrawlAI post covers that path.

Pick the tool whose default scope matches your problem.

Feeding the output to an LLM

The most common downstream use of clean URL content is a prompt. Here is the pattern in pseudocode.

const scrape = await fetch(`https://crawlai.io/api/scrape/${token}`, {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    url: "https://example.com/article",
    selector: "article"
  })
}).then(r => r.json());

const prompt = `
Summarise the following article in three bullet points.

TITLE: ${scrape.data.title}
URL: ${scrape.data.finalUrl}

CONTENT:
${scrape.data.content}
`;

const summary = await callYourLLM(prompt);

That is the entire pipeline. Fetch, format, prompt. The cleaned content is short enough to fit in a modern context window for most pages, and the title and URL give the model the framing metadata it needs.

For a structured version of the same pattern (extracting specific fields instead of summarising), the GPT-5 extraction tutorial walks through schemas for articles, products, and contact pages.

Building a RAG index

The other big use case is feeding many pages into a vector store. The shape is the same as for structured extraction. You hold the URL list, you call the API, you process the result.

A typical loop.

For each URL in your list, call POST /api/scrape/{token} with selector: "main" (or whatever narrows to the article body on the source).
Take data.content, chunk it into 500 to 1500 token segments.
Embed each chunk.
Store the chunks alongside data.title, data.finalUrl, and data.metaDescription as metadata.
At query time, retrieve top-K chunks and pass them to the LLM with the user question.

The URL to LLM context guide covers this pattern with chunking strategies and metadata schemas in more detail.

Plain text vs markdown: does it matter?

For LLMs, mostly no. A modern model reads plain text fine, and the small amount of structure markdown adds (headings, lists) does not change retrieval quality in practice. If you are feeding content into a prompt, plain text is fine.

For humans and for archives, yes. Polished markdown is easier to read, easier to diff, and easier to convert to other formats later. If you are building a personal read-later archive or a documentation mirror, markdown formatting pulls its weight.

Here is the same idea as a quick comparison.

Use case	Plain text (CrawlAI)	Markdown (Firecrawl)
LLM prompt context	Fits naturally	Also fine, slightly more tokens
RAG embedding	Ideal, less noise per chunk	Works, headings sometimes help retrieval
Human-readable archive	Loses structure	Preserves structure
Static site mirror	Awkward	Natural
Mixed structured + text output	One call returns both	Two calls or one heavier call

There is no universal winner. Pick the format that matches the destination.

What about converting plain text to markdown yourself?

If you are mostly happy with CrawlAI's output but want markdown for a specific job, an LLM can reformat plain content into clean markdown in one extra call. Pass the data.content to GPT-5 with a prompt like "reformat the following article as clean markdown with appropriate headings and lists, preserving the original wording", and you get a respectable result.

This is fine for one-off jobs. For systematic markdown pipelines, a tool that produces markdown natively saves the extra round trip.

Where to go from here

If your job is "I have URLs, I want clean text for an LLM or a RAG index", CrawlAI is a small, predictable answer. One endpoint, three fields, structured envelope, ready to chain into the rest of your pipeline.

If your job leans heavily on the markdown format itself, the Firecrawl alternative page is the honest comparison.

For the broader story of how LLM-driven web extraction changes scraping pipelines, the main AI scraping guide covers the strategic picture. For the structured-extraction sibling of this post, the HTML to JSON guide walks through schema design.

The full API contract, including every response field and language examples in cURL, JavaScript, Python, and PHP, lives in the documentation.

Try CrawlAI

Turn any URL into structured JSON with your own schema, powered by GPT-5. Pay-as-you-go starts at $10.

Get Started Read the docs