Spaces:

JimLin0704
/

Crawl4AI

Sleeping

App Files Files Community

Crawl4AI / docs /md_v2 /basic /output-formats.md

amaye15

test

03c0888 11 months ago

preview code

raw

history blame

3.15 kB

	# Output Formats

	Crawl4AI provides multiple output formats to suit different needs, ranging from raw HTML to structured data using LLM or pattern-based extraction, and versatile markdown outputs.

	## Basic Formats

	```python
	result = await crawler.arun(url="https://example.com")

	# Access different formats
	raw_html = result.html # Original HTML
	clean_html = result.cleaned_html # Sanitized HTML
	markdown_v2 = result.markdown_v2 # Detailed markdown generation results
	fit_md = result.markdown_v2.fit_markdown # Most relevant content in markdown
	```

	> Note: The `markdown_v2` property will soon be replaced by `markdown`. It is recommended to start transitioning to using `markdown` for new implementations.

	## Raw HTML

	Original, unmodified HTML from the webpage. Useful when you need to:
	- Preserve the exact page structure.
	- Process HTML with your own tools.
	- Debug page issues.

	```python
	result = await crawler.arun(url="https://example.com")
	print(result.html) # Complete HTML including headers, scripts, etc.
	```

	## Cleaned HTML

	Sanitized HTML with unnecessary elements removed. Automatically:
	- Removes scripts and styles.
	- Cleans up formatting.
	- Preserves semantic structure.

	```python
	config = CrawlerRunConfig(
	excluded_tags=['form', 'header', 'footer'], # Additional tags to remove
	keep_data_attributes=False # Remove data-* attributes
	)
	result = await crawler.arun(url="https://example.com", config=config)
	print(result.cleaned_html)
	```

	## Standard Markdown

	HTML converted to clean markdown format. This output is useful for:
	- Content analysis.
	- Documentation.
	- Readability.

	```python
	config = CrawlerRunConfig(
	markdown_generator=DefaultMarkdownGenerator(
	options={"include_links": True} # Include links in markdown
	)
	)
	result = await crawler.arun(url="https://example.com", config=config)
	print(result.markdown_v2.raw_markdown) # Standard markdown with links
	```

	## Fit Markdown

	Extract and convert only the most relevant content into markdown format. Best suited for:
	- Article extraction.
	- Focusing on the main content.
	- Removing boilerplate.

	To generate `fit_markdown`, use a content filter like `PruningContentFilter`:

	```python
	from crawl4ai.content_filter_strategy import PruningContentFilter

	config = CrawlerRunConfig(
	content_filter=PruningContentFilter(
	threshold=0.7,
	threshold_type="dynamic",
	min_word_threshold=100
	)
	)
	result = await crawler.arun(url="https://example.com", config=config)
	print(result.markdown_v2.fit_markdown) # Extracted main content in markdown
	```

	## Markdown with Citations

	Generate markdown that includes citations for links. This format is ideal for:
	- Creating structured documentation.
	- Including references for extracted content.

	```python
	config = CrawlerRunConfig(
	markdown_generator=DefaultMarkdownGenerator(
	options={"citations": True} # Enable citations
	)
	)
	result = await crawler.arun(url="https://example.com", config=config)
	print(result.markdown_v2.markdown_with_citations)
	print(result.markdown_v2.references_markdown) # Citations section
	```